3.0 Scripts, R Markdown, and reproducible research

3.1 Scripts in R

Just choose File > New File > New script and a script window will open up in the upper left of the full RStudio window.

u <- "http://blue.for.msu.edu/FOR875/data/WorldBank.csv"
WorldBank <- read.csv(u, header = TRUE, stringsAsFactors = FALSE)
names(WorldBank)
##  [1] "iso2c"                        "country"                     
##  [3] "year"                         "fertility.rate"              
##  [5] "life.expectancy"              "population"                  
##  [7] "GDP.per.capita.Current.USD"   "X15.to.25.yr.female.literacy"
##  [9] "iso3c"                        "region"                      
## [11] "capital"                      "longitude"                   
## [13] "latitude"                     "income"                      
## [15] "lending"

We will try to create a scatter plot of fertility rate versus life expectancy of countries for the year 1960.

fertility <- WorldBank$fertility.rate[WorldBank$year == 1960]
lifeexp <- WorldBank$life.expectancy[WorldBank$year == 1960]
fertility[1:10]
##  [1]    NA 6.928 7.671 4.425 6.186 4.550 7.316 3.109    NA 2.690
plot(lifeexp, fertility)

pop <- WorldBank$population[WorldBank$year == 1960]
region <- WorldBank$region[WorldBank$year == 1960]

First we will create the axes, etc. for the plot, but not plot the points, type="n" tells R to do this.

plot(lifeexp, fertility, type="n")

symbols(lifeexp, fertility, circles=sqrt(pop/pi), inches=0.35, bg=match(region, unique(region)))

3.2 R Markdown

R Markdown provides a way to include R code that read in data, create graphics, or perform analyses, in a single document which is processed to create a research paper or homework assignment or other written product.

3.3 Basic formatting:

italic bold > block quote

strikethrough A code chunk:

x <- 1:10
y <- 10:1
mean(x)
## [1] 5.5
sd(y)
## [1] 3.02765

Inline code: 10 Inline code not executed: 5+5

3.31 Text: Lists and Headers

For an unordered list, either an asterisk, a plus sign, or a minus sign may precede list items.

For an ordered list use a numeral followed by a period and a space (1. or 2. or 3. or …) For an ordered list, the first list item will be labeled with the number or letter that you specify, but subsequent list items will be numbered sequentially.

An unordered list:

  • List item 1
  • List item 2
    • Second level list item 1
    • Second level list item 2
      • Third level list item
  • List item 3

An ordered list:

  1. List item 1
  2. List item 2
    1. Sub list item 1
    2. Sub list item 2
  3. List item 3

use # to indicate headers. # A first *level* ~~header~~ ## A second level header

Text subscripts and superscripts: x2 + y2 103 = 1000 Mathematics examples: \(x_a\)
\(x^a\)

3.4 Code Chunks

echo=FALSE specifies that the R code should not be printed (but any output of the R code should be printed) in the resulting document.

  1. include=FALSE specifies that neither the R code nor the output should be printed. However, the objects created by the code chunk will be available for use in later code chunks.

  2. eval=FALSE specifies that the R code should not be evaluated. The code will be printed unless, for example, echo=FALSE is also given as an option.
  3. error=FALSE and warning=FALSE specify that, respectively, error messages and warning messages generated by the R code should not be printed.

  4. The comment option allows a specified character string to be prepended to each line of results. By default this is set to comment = ## which explains the two hash marks preceding results in Figure 3.3 for example. Setting comment = NA presents output without any character string prepended. That is done in most code chunks in this book.

  5. prompt=TRUE specifies that R prompt > will be prepended to each line of R code shown in the document. prompt = FALSE specifies that command prompts should not be included.

  6. fig.height and fig.width specify the height and width of figures generated by R code. These are specified in inches, so for example fig.height=4 specifies a four inch high figure.

  7. code Set to R code. Knitr will replace the code in the chunk with the code in the code option.

  8. engine ‘R’ Knitr will evaluate the chunk in the named language, e.g. engine = 'python'.

  9. collapse If TRUE, knitr will collapse all the source and output blocks created by the chunk into a single block.

  10. results ‘markup’ If 'hide', knitr will not display the code’s results in the final document. If 'hold', knitr will delay displaying all output pieces until the end of the chunk. If 'asis', knitr will pass through results without reformatting them (useful if results return raw HTML, etc.)

  11. error If FALSE, knitr will not display any error messages generated by the code.

  12. message If FALSE, knitr will not display any messages generated by the code.

  13. warning If FALSE, knitr will not display any warning messages generated by the code.

  14. strip.white If TRUE, knitr will remove white spaces that appear at the beginning or end of a code chunk.

  15. autodep If TRUE, knitr will attempt to figure out dependencies between chunks automatically by analyzing object names.

  16. cache If TRUE, knitr will cache the results to reuse in future knits. Knitr will reuse the results until the code chunk is altered.

  17. cache.comments NULL If FALSE, knitr will not rerun the chunk if only a code comment has changed.

4.0 Data Structures

4.1 Vectors

Think of a vector as a structure to represent one variable in a data set. For example

Vectors in R can only contain elements of one type. a vector might hold the weights, in pounds, of 7 people in a data set.

weight <- c(123, 157, 205, 199, 223, 140, 105)
weight
## [1] 123 157 205 199 223 140 105
gender <- c("female", "female", "male", "female", "male", "male", "female")
gender
## [1] "female" "female" "male"   "female" "male"   "male"   "female"

Notice that elements of a vector are separated by commas when using the c() function to create a vector.

The c() function also can be used to add to an existing vector. add at the last element.

weight <- c(weight, 194)
weight
## [1] 123 157 205 199 223 140 105 194
gender <- c(gender, "male")

gender
## [1] "female" "female" "male"   "female" "male"   "male"   "female" "male"

4.11 Types, conversion, coercion

typeof(weight)
## [1] "double"
typeof(gender)
## [1] "character"
bp <- c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE)
bp
## [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
typeof(bp)
## [1] "logical"
as.character(bp)
## [1] "FALSE" "TRUE"  "FALSE" "FALSE" "TRUE"  "FALSE" "TRUE"  "FALSE"
as.numeric(bp)
## [1] 0 1 0 0 1 0 1 0
is.character(bp)
## [1] FALSE
char_bp = as.character(bp)
is.character(char_bp)
## [1] TRUE
weight.int <- as.integer(weight)
weight.int
## [1] 123 157 205 199 223 140 105 194
weight.char <- as.character(weight)
weight.char
## [1] "123" "157" "205" "199" "223" "140" "105" "194"
bp.double <- as.double(bp)
bp.double
## [1] 0 1 0 0 1 0 1 0

4.12 Coercion

Values are converted to the simplest type to represent all the information Character > complex > double > integer > logical

c(4 + 3i, TRUE, "red", 6) # will become character
## [1] "4+3i" "TRUE" "red"  "6"
c(4 + 3i, TRUE, 4/5, 6) # will become complex
## [1] 4.0+3i 1.0+0i 0.8+0i 6.0+0i
c(4 + 3, TRUE, 4/5, 6.0) # will become double
## [1] 7.0 1.0 0.8 6.0
c(4 + 3, TRUE, 6) # will become Integer
## [1] 7 1 6
yy <- c(1, 2, 3, "dog")
yy
## [1] "1"   "2"   "3"   "dog"

4.13 Accessing specific elements of vectors

weight[5]
## [1] 223
weight[1:3]
## [1] 123 157 205
length(weight)
## [1] 8
weight[length(weight)]
## [1] 194
weight[3] <- 200000
weight
## [1]    123    157 200000    199    223    140    105    194

Negative numbers in the square brackets tell R to omit the corresponding value.

weight[-3]
## [1] 123 157 199 223 140 105 194
lessWeight <- weight[-c(1, 3, 5)]
lessWeight
## [1] 157 199 140 105 194
weight
## [1]    123    157 200000    199    223    140    105    194

4.2 Factors

Categorical variables such as gender can be represented as character vectors. In many cases this simple representation is sufficient. Factors in R provide a more sophisticated way to represent categorical variables.

age <- c("middle age", "senior", "middle age", "senior", "senior", "senior", "senior", "middle age")
age
## [1] "middle age" "senior"     "middle age" "senior"     "senior"    
## [6] "senior"     "senior"     "middle age"
income <- c("lower", "lower", "upper", "middle", "upper", "lower", "lower", "middle")
income
## [1] "lower"  "lower"  "upper"  "middle" "upper"  "lower"  "lower"  "middle"
age <- factor(age, levels=c("youth", "young adult", "middle age", "senior"))
age
## [1] middle age senior     middle age senior     senior     senior    
## [7] senior     middle age
## Levels: youth young adult middle age senior
income <- factor(income, levels=c("lower", "middle", "upper"),ordered = TRUE)
income
## [1] lower  lower  upper  middle upper  lower  lower  middle
## Levels: lower < middle < upper

4.14 Vector Recycling

x = c(2,3,4)
y = c(5,6,7)
y + x
## [1]  7  9 11
x_small = c(1,2,3)
y_big = c(1,2,3,4,5,6)
 
x_small + y_big
## [1] 2 4 6 5 7 9
x <- weight
x <- c(x, NA, NA)
sum(is.na(x))
## [1] 2
x[100]
## [1] NA
x = 1:10
mean(na.rm=FALSE, x, trim=0)
## [1] 5.5
y <- mpg[, 2]
dim(mpg)
## [1] 234  11
z <- rep(0, length = dim(mpg)[2])
for (i in 1:dim(mpg)[2]){
    z[i] <- sum(mpg[, i] < 5)}
z
##  [1]   0   6 196   0  81   0 103   0   0   0   5
a <- c(3, 6, 9, 12)
b <- matrix(c(1:4), nrow = 2, ncol = 2)
c <- c("good", "bad", "indifferent")
my.list <- list(a, b, c)
my.list
## [[1]]
## [1]  3  6  9 12
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
## [1] "good"        "bad"         "indifferent"
 my.list[[1]][3]
## [1] 9
  my.list[[2]][2, ]
## [1] 2 4

4.3 Missing Data, Infinity,

missingCharacter <- c("dog", "cat", NA, "pig", NA, "horse")
missingCharacter
## [1] "dog"   "cat"   NA      "pig"   NA      "horse"
is.na(missingCharacter)
## [1] FALSE FALSE  TRUE FALSE  TRUE FALSE

remove the missing value(s) and then perform the computation.

mean(c(1, 2, 3, NA, 5), na.rm = TRUE)
## [1] 2.75

1. NaN` represents the result of a calculation where the result is undefined, such as dividing zero by zero.

2. ### Inf and -Inf represent infinity and negative infinity (and numbers which are too large in magnitude to be represented as floating point numbers).

4.5 Data Frames

data is rectangular in form, with variables as columns and cases as rows.

gender <- c("female", "female", "male", "female", "male", "male", "female")
bp <- c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE)
weight <- c(123, 157, 205, 199, 223, 140, 105)

healthData <- data.frame(Weight = weight, Gender=gender, bp.meds = bp,  stringsAsFactors=FALSE)

healthData
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE
names(healthData)
## [1] "Weight"  "Gender"  "bp.meds"
colnames(healthData)
## [1] "Weight"  "Gender"  "bp.meds"
names(healthData) <- c("Wt", "Gdr", "bp")
healthData
##    Wt    Gdr    bp
## 1 123 female FALSE
## 2 157 female  TRUE
## 3 205   male FALSE
## 4 199 female FALSE
## 5 223   male  TRUE
## 6 140   male FALSE
## 7 105 female  TRUE
rownames(healthData)
## [1] "1" "2" "3" "4" "5" "6" "7"
names(healthData) <- c("Weight", "Gender", "bp.meds")
healthData
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE

4.5.1 Accessing specific elements of data frames

Data frames are two-dimensional, so to access a specific element (or elements) we need to specify both the row and column.

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars[3, 4]
## [1] 93
row.names(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
mtcars[1:3, 2:3]
##               cyl disp
## Mazda RX4       6  160
## Mazda RX4 Wag   6  160
## Datsun 710      4  108
mtcars["cyl"]
##                     cyl
## Mazda RX4             6
## Mazda RX4 Wag         6
## Datsun 710            4
## Hornet 4 Drive        6
## Hornet Sportabout     8
## Valiant               6
## Duster 360            8
## Merc 240D             4
## Merc 230              4
## Merc 280              6
## Merc 280C             6
## Merc 450SE            8
## Merc 450SL            8
## Merc 450SLC           8
## Cadillac Fleetwood    8
## Lincoln Continental   8
## Chrysler Imperial     8
## Fiat 128              4
## Honda Civic           4
## Toyota Corolla        4
## Toyota Corona         4
## Dodge Challenger      8
## AMC Javelin           8
## Camaro Z28            8
## Pontiac Firebird      8
## Fiat X1-9             4
## Porsche 914-2         4
## Lotus Europa          4
## Ford Pantera L        8
## Ferrari Dino          6
## Maserati Bora         8
## Volvo 142E            4

Note that mtcars[,1] returns ALL elements in the first column.

mtcars[, 1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

For a data frame there is another way to access specific columns, using the $ notation.

mtcars$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4

4.6 Lists


Technically a list is a vector, but one in which elements can be of different types.


Key Properties:

  • Heterogenous
  • Index by position/name my_list[[2]]
    • my_list[2] will return 2 element as list
  • Index with multiple position and names
  • Have names

The lm function returns a list

mpgHpLinMod <- lm(mpg ~ hp, data = mtcars)
mode(mpgHpLinMod)
## [1] "list"
names(mpgHpLinMod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
mpgHpLinMod
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp  
##    30.09886     -0.06823
mpgHpLinMod$coefficients
## (Intercept)          hp 
## 30.09886054 -0.06822828

The list function is used to create lists.

temporaryList <- list(first=weight, second=healthData, pickle=list(a = 1:10, b=healthData))

temporaryList$first
## [1] 123 157 205 199 223 140 105
temporaryList[[1]]
## [1] 123 157 205 199 223 140 105
temporaryList[[1]][1]
## [1] 123
temporaryList$second
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE
temporaryList[c(1, 2)] # returns a list with 1st and 2nd elements
## $first
## [1] 123 157 205 199 223 140 105
## 
## $second
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE
temporaryList[[c(1, 2)]]
## [1] 157
temporaryList[2]
## $second
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE
my_list = list(1, 2, 3, c(3, 5, 6), c("a", "b", "c"))
my_list[1]
## [[1]]
## [1] 1
my_list[[4]][1]
## [1] 3
my_list[[c(4, 2)]]
## [1] 5

4.7 Subsetting with logical vectors

weight
## [1] 123 157 205 199 223 140 105
gender
## [1] "female" "female" "male"   "female" "male"   "male"   "female"
gender[weight > 200]
## [1] "male" "male"

4.7.1 Modifying or creating objects via subsetting

lightweight <- weight[weight < 200]
lightweight
## [1] 123 157 199 140 105
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x[x < 5] <- 0
x
##  [1]  0  0  0  0  5  6  7  8  9 10

4.7.2 Logical subsetting and data frames

Example:

x = c(-3:3)
x
## [1] -3 -2 -1  0  1  2  3
z <- c(T, F, F, T, F, F, F)
x[z]
## [1] -3  0

** We can use <, >, ==, >=, <= != to create conditions with & and I **

healthData
##   Weight Gender bp.meds
## 1    123 female   FALSE
## 2    157 female    TRUE
## 3    205   male   FALSE
## 4    199 female   FALSE
## 5    223   male    TRUE
## 6    140   male   FALSE
## 7    105 female    TRUE
healthData$Weight[healthData$Gender == "male"]
## [1] 205 223 140
healthData[healthData$Weight > 190, 2:3]
##   Gender bp.meds
## 3   male   FALSE
## 4 female   FALSE
## 5   male    TRUE
temporaryDataFrame <- data.frame(V1 = c(1, 2, 3, 4, NA), V2 = c(NA, 1, 4, 5, NA), V3 = c(1, 2, 3, 5, 7))
temporaryDataFrame
##   V1 V2 V3
## 1  1 NA  1
## 2  2  1  2
## 3  3  4  3
## 4  4  5  5
## 5 NA NA  7
is.na(temporaryDataFrame)
##         V1    V2    V3
## [1,] FALSE  TRUE FALSE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE
## [5,]  TRUE  TRUE FALSE
rowSums(is.na(temporaryDataFrame))
## [1] 1 0 0 0 2
colSums(is.na(temporaryDataFrame))
## V1 V2 V3 
##  1  2  0

4.8 Patterned data

The seq() function generates either a sequence of pre-specified length or a sequence with pre-specified increments.

seq(from = 0, to = 1, length = 11)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(from = 1, to = 5, by = 1/3)
##  [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000
##  [8] 3.333333 3.666667 4.000000 4.333333 4.666667 5.000000

The rep() function replicates the values in a given vector.

rep(c(1, 2, 4), length = 9)
## [1] 1 2 4 1 2 4 1 2 4
rep(c(1, 2, 4), times = 3)
## [1] 1 2 4 1 2 4 1 2 4
rep(c("a", "b", "c"), times = c(3, 2, 7))
##  [1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "c" "c" "c"

Functions and Programming

Learning about R’s programming capabilities is an important step in gaining facility with functions.

7.1 R functions

u.corn <- "http://blue.for.msu.edu/FOR875/data/corn.csv"
corn <- read.csv(u.corn, header = TRUE)
corn
##    regular kiln_dried
## 1     1903       2009
## 2     1935       1915
## 3     1910       2011
## 4     2496       2463
## 5     2108       2180
## 6     1961       1925
## 7     2060       2122
## 8     1444       1482
## 9     1612       1542
## 10    1316       1443
## 11    1511       1535
paired_t <- function(x1, x2) {
  n <- length(x1)
  dbar <- mean(x1 - x2)
  s_d <- sd(x1 - x2)
  tstat <- dbar/(s_d/sqrt(n))
  pval <- 2 * (1 - pt(abs(tstat), n - 1))
  margin <- qt(0.975, n - 1) * s_d/sqrt(n)
  lcl <- dbar - margin
  ucl <- dbar + margin
  return(list(tstat = tstat, pval = pval, lcl = lcl, ucl = ucl))
}

paired_t(x1 = corn$kiln_dried, x2 = corn$regular)
## $tstat
## [1] 1.690476
## 
## $pval
## [1] 0.1218166
## 
## $lcl
## [1] -10.7271
## 
## $ucl
## [1] 78.18164

7.1.1 Creating functions

A function can be written in any text editor, saved as a plain text file (possibly with a .r extension), and then read into R using the source() function.

7.2 Programming: Conditional Statements

The argument to the if() function is evaluated. If the argument returns TRUE the ensuing code is executed.

7.2.1 Comparison and logical operators

c(FALSE, TRUE, FALSE) || c(TRUE, FALSE, FALSE)
## [1] TRUE
c(FALSE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
## [1]  TRUE  TRUE FALSE
c(FALSE, TRUE, FALSE) && c(TRUE, TRUE, FALSE)
## [1] FALSE
c(FALSE, TRUE, FALSE) & c(TRUE, TRUE, FALSE)
## [1] FALSE  TRUE FALSE

7.2.2 If else statements

Sign <- function(x) {
  if (x < 0) {
     print("the number is negative")
  }else if (x > 0) {
    print("the number is positive")
  }else {
    print("the number is zero")
  }
}
Sign(3)
## [1] "the number is positive"
Sign(-3)
## [1] "the number is negative"
Sign(0)
## [1] "the number is zero"

7.3 Computer Arithmetic

R does not perform exact arithmetic

2^-30
## [1] 9.313226e-10
2^-30 + (2^30 - 2^30)
## [1] 9.313226e-10
(2^-30 + 2^30) - 2^30
## [1] 0
1.5 - 1.4 == 0.1
## [1] FALSE
all.equal((1.5 - 1.4), 0.1)
## [1] TRUE

7.4 Loops

Loops are an important component of any programming language, including R. Vectorized calculations and functions such as apply() make loops a bit less central to R than to many other languages.

7.4.1 A repeat loop

A repeat loop just repeats a given expression over and over again until a break statement is encountered.

k <- 1
repeat {
  if (1 == 1 + 1/2^k){
    break
  }else {
    k <- k + 1
  }
}
k
## [1] 53

7.4.2 A while loop

A while loop has the form while (condition) expression As long as the condition is TRUE the expression is evaluated. Once the condition is FALSE control is transferred outside the loop.

k <- 1
while (1 != 1 + 1/2^k) {
  k <- k + 1
}
k
## [1] 53

7.4.3 A for loop

A for loop has the form for (variable in vector) expression The for loop sets the variable equal to each element of the vector in succession, and evaluates the expression each time

x <- 1:10
S <- 0

for (i in 1:length(x)){
  S <- S + x[i]
}
S
## [1] 55

5.0 ggplot2

  1. gg stands for grammar of graphics
  2. ggigraph is an interactive graph
  3. ggrepel overlapping text labels
  4. ggraph graphs
  5. ggthemes themes
Template

ggplot ( data =<dataframe>, mapping = aes(<Mappings1>)) + <Geom_fen>(aes(<Mappings2>)) + <Facet_function> facet_wrap(~var) facet_grid(~var1, var2) + labs (x, y, title, caption, subtitle, color, size)

Examples
data("mtcars")
ggplot(data = mtcars,
       mapping = aes(x=hp, y=mpg)) +
  geom_point(aes(color = factor(cyl),
                 shape = factor(am)),
             size=5)

data("mtcars")
ggplot(data = mtcars,
       mapping = aes(x=hp, y=mpg)) +
  geom_point(aes(color = factor(cyl),
                 shape = factor(am)),
             size=5) +
  facet_wrap(~factor(cyl))

5.1 Scatter Plots

Scatter plots are a workhorse of data visualization and provide a good entry point to the ggplot2 system.

data(iris)
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
library(ggplot2)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
       geom_point()

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
       geom_point(size = 4, aes(color=Species, shape=Species))

5.1.2 Adding lines to a scatter plot

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
       geom_point(size=3, aes(color=Species)) +
       stat_smooth(method = lm, se=FALSE)

For the iris data, it probably makes more sense to fit separate lines by species. This can be specified using the aes() function inside stat_smooth().

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
      geom_point(size=3, aes(color=Species)) +
      stat_smooth(method = lm, se=FALSE, aes(color=Species))

Another common use of line segments in a graphic is to connect the points in order, accomplished via the geom_line() function.

ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
      geom_point(size = 4, aes(color=Species, shape = Species)) +
      geom_line()

      ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
      geom_point(size = 4, aes(color=Species)) +
      geom_line(aes(color=Species))

5.2.0 Labels, axes, text, etc.

u.crime <- "http://blue.for.msu.edu/FOR875/data/crimeRatesByState2005.csv"
crime <- read.csv(u.crime, header=TRUE)
str(crime)
## 'data.frame':    50 obs. of  9 variables:
##  $ state              : Factor w/ 50 levels "Alabama ","Alaska ",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ murder             : num  8.2 4.8 7.5 6.7 6.9 3.7 2.9 4.4 5 6.2 ...
##  $ Forcible_rate      : num  34.3 81.1 33.8 42.9 26 43.4 20 44.7 37.1 23.6 ...
##  $ Robbery            : num  141.4 80.9 144.4 91.1 176.1 ...
##  $ aggravated_assult  : num  248 465 327 387 317 ...
##  $ burglary           : num  954 622 948 1085 693 ...
##  $ larceny_theft      : num  2650 2599 2965 2711 1916 ...
##  $ motor_vehicle_theft: num  288 391 924 262 713 ...
##  $ population         : int  4627851 686293 6500180 2855390 36756666 4861515 3501252 873092 18328340 9685744 ...

Here we use labs() to change the x and y axis labels and other descriptive text.

ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft)) +
  geom_point() +
  labs(x = "Burglaries per 100,000 population",
       y = "Motor vehicle theft per 100,000 population",
       title = "Burglaries vs motor vehicle theft for US states",
       subtitle = "2005 crime rates and 2008 population",
       caption = "Data from Nathan Yau http://flowingdata.com"
)

5.2.1 Customizing axes

ggplot also provides default axis extents (i.e., limits) and other axis features. These, and other axis features such as tick marks, labels, and transformations, can be changed using the scale functions

ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft)) +
  geom_point() +
  scale_x_continuous(name="Burglaries per 100,000 population",
                    limits=c(0,max(crime$burglary))) +
  scale_y_continuous(name="Motor vehicle theft per 100,000 population",
                      limits = c(0, max(crime$motor_vehicle_theft)))

5.2.2 Text, point size, and color

ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft,
       size=population/100000)) +
       geom_point(color = "blue") +
       geom_label(aes(label = state), alpha = 0.5) +
       scale_x_continuous(name="Burglaries per 100,000 population",
                     limits=c(0,max(crime$burglary))) +
        scale_y_continuous(name="Motor vehicle theft per 100,000 population",
                     limits = c(0, max(crime$motor_vehicle_theft))) +
        labs(size="Populationnn(100,000)")

library(ggrepel)
ggplot(data = crime, aes(x = burglary, y = motor_vehicle_theft,
      size=population/100000)) +
      geom_point(color = "blue") +
      scale_x_continuous(name="Burglaries per 100,000 population",
                    limits=c(0,max(crime$burglary))) +
      scale_y_continuous(name="Motor vehicle theft per 100,000 population",
                  limits = c(0, max(crime$motor_vehicle_theft))) +
      labs(size="Populationnn(100,000)") +
       ggrepel::geom_label_repel(aes(label = state), alpha = 0.5)

geom_raster: heatmap (x,y) geom_density: density plot

5.3.1 Histograms

Simon Newcomb conducted several experiments to estimate the speed of light by measuring the time it took for light to travel from his laboratory to a mirror at the base of the Washington Monument, and then back to his lab.

u.newcomb <- "http://blue.for.msu.edu/FOR875/data/Newcomb.csv"
Newcomb <- read.csv(u.newcomb, header = TRUE)
head(Newcomb)
##   Time
## 1   28
## 2   26
## 3   33
## 4   24
## 5   34
## 6  -44
ggplot(Newcomb, aes(x = Time)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Newcomb, aes(x = Time)) +
geom_histogram(binwidth = 5, color = "black", fill = "blue" )

5.3.2 Boxplots

library(gapminder)
ggplot(data = subset(gapminder, year == 2002),
     aes(x = continent, y = gdpPercap)) +
  geom_boxplot(color = "black", fill = "lightblue")

ggplot(data = subset(gapminder, year == 2002),
    aes(x = continent, y = gdpPercap)) +
  geom_boxplot(color = "red", fill = "lightblue") +
  scale_x_discrete(name = "Continent") +
  scale_y_continuous(name = "Per Capita GDP") + coord_flip()

5.3.3 Bar graphs

As part of a study, elementary school students were asked which was more important to them: good grades, popularity, or athletic ability. Here is a brief look at the data.

u.goals <- "http://blue.for.msu.edu/FOR875/data/StudentGoals.csv"
StudentGoals <- read.csv(u.goals, header = TRUE)
head(StudentGoals)
##   Gender Grade Age  Race  Type School   Goals Grades Sports Looks Money
## 1    boy     5  11 White Rural    Elm  Sports      1      2     4     3
## 2    boy     5  10 White Rural    Elm Popular      2      1     4     3
## 3   girl     5  11 White Rural    Elm Popular      4      3     1     2
## 4   girl     5  11 White Rural    Elm Popular      2      3     4     1
## 5   girl     5  10 White Rural    Elm Popular      4      2     1     3
## 6   girl     5  11 White Rural    Elm Popular      4      2     1     3

First a simple bar graph of the most important goal chosen is drawn, followed by a stacked bar graph which also includes the student’s gender, followed by a side by side bar graph which includes the student’s gender.

ggplot(StudentGoals, aes(x = Goals)) + geom_bar()

ggplot(StudentGoals, aes(x = Goals, fill = Gender)) + geom_bar()

ggplot(StudentGoals, aes(x = Goals, fill = Gender)) +
  geom_bar(position = "dodge")

In this example R counted the number of students who had each goal and used these counts as the height of the bars. Sometimes the data contain the bar heights as a variable. For example, we create a bar graph of India’s per capita GDP with separate bars for each year in the data

ggplot(subset(gapminder, country == "India"), aes(x = year, y = gdpPercap)) +
  geom_bar(stat = "identity", color = "black", fill = "steelblue2") +
  ggtitle("India's per-capita GDP")

5.3.4 Graphs of functions

x <- seq(-pi, pi, len = 1000)
sin.data <- data.frame(x = x, y = sin(x))
ggplot(data = sin.data, aes(x = x, y = y)) + geom_line() +
  scale_y_continuous(name = "sin(x)")

5.4 Themes

Default themes include: theme_bw(), theme_classic(), theme_dark(), theme_gray(), theme_light(), theme_linedraw(), theme_minimal(), and theme_void().

ggplot(data = sin.data, aes(x = x, y = y)) + geom_line() +
  scale_y_continuous(name = "sin(x)") +
  theme_classic()

The ggthemes add-on package https://github.com/jrnold/ggthemes by Jefrey Arnold provides a large selection of themes beyond the eight themes that come with ggplot2

5.5 Saving graphics

formats. The ggsave() function will allow you to save your most recent ggplot() to a variety of vector (e.g., “eps”, “ps”, “pdf”, “svg”) or raster (e.g., “jpeg”, “tif”, “png”, “bmp”, “wmf”) formats. The subsequent call to ggsave() saves thesin.data plot to a pdf file called “sin-plot.pdf”.

ggplot(filename = "sin-plot.pdf", device="pdf")

Working with Data

This involves bringing data into R, exporting data from R in a form that is readable by other software, cleaning and reshaping data, and other data manipulation

6.1 Reading data into R

Data come in a dizzying variety of forms. It might be in a proprietary format such as an .xlsx Excel file, a .sav SPSS file, or a .mtw Minitab file.

The read.table() function is used to read these data into an R data frame.

u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.csv"
BrainBody <- read.table(file = u.bb, header = TRUE, sep = ",",
stringsAsFactors = FALSE)
head(BrainBody)
##       body brain            name
## 1     1.35   8.1 Mountain beaver
## 2   465.00 423.0             Cow
## 3    36.33 119.5       Grey wolf
## 4    27.66 115.0            Goat
## 5     1.04   5.5      Guinea pig
## 6 11700.00  50.0     Dipliodocus

The arguments used in this call to read.table() include:

  1. file = u.bb, which tells R the location of the file.
  2. header = TRUE, which tells R the first line of the file gives the names of the variables.
  3. sep = ",", which tells R that a comma separates the fields in the file.
  4. stringsAsFactors = FALSE which tells R not to convert character vectors to factors.

The function read.csv() is the same as read.table() except the default separator is a comma, whereas the default separator for read.table() is whitespace.

The file BrainAndBody.tsv contains the same data, except a tab is used in place of a comma to separate fields.

u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.tsv"
BrainBody <- read.table(file = u.bb, header = TRUE, sep = "\t",
stringsAsFactors = FALSE)
head(BrainBody)
##       body brain            name
## 1     1.35   8.1 Mountain beaver
## 2   465.00 423.0             Cow
## 3    36.33 119.5       Grey wolf
## 4    27.66 115.0            Goat
## 5     1.04   5.5      Guinea pig
## 6 11700.00  50.0     Dipliodocus

A third file, BrainAndBody.txt, contains the same data, but also contains a fewnlines of explanatory text above the names of the variables. It also uses whitespacen rather than a comma or a tab as a separator.

u.bb <- "http://blue.for.msu.edu/FOR875/data/BrainAndBody.txt"
BrainBody3 <- read.table(u.bb, header = TRUE, sep = " ", stringsAsFactors = FALSE, skip = 4)
BrainBody3[1:10,]
##        body  brain            name
## 1      1.35    8.1 Mountain beaver
## 2    465.00  423.0             Cow
## 3     36.33  119.5       Grey wolf
## 4     27.66  115.0            Goat
## 5      1.04    5.5      Guinea pig
## 6  11700.00   50.0     Dipliodocus
## 7   2547.00 4603.0  Asian elephant
## 8    187.10  419.0          Donkey
## 9    521.00  655.0           Horse
## 10    10.00  115.0    Potar monkey

6.1.1 Reading data with missing observations

The read.table() function has an argument na.string which allows the user to specify how missing data is indicated in the source file.

Argument na.string = "" which tells R the file indicates missing data by leaving the appropriate entry blank.

u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014.csv"
WeatherKLAN2014 <- read.csv(u.weather, header=TRUE, stringsAsFactors = FALSE, na.string = "")

WeatherKLAN2014[1:15,]
##        EST Max.TemperatureF Min.TemperatureF        Events
## 1   1/1/14               14                9          Snow
## 2   1/2/14               13               -3          Snow
## 3   1/3/14               13              -11          Snow
## 4   1/4/14               31               13          Snow
## 5   1/5/14               29               16      Fog-Snow
## 6   1/6/14               16              -12      Fog-Snow
## 7   1/7/14                2              -13          Snow
## 8   1/8/14               17               -1          Snow
## 9   1/9/14               21                2          Snow
## 10 1/10/14               39               21 Fog-Rain-Snow
## 11 1/11/14               41               32      Fog-Rain
## 12 1/12/14               39               31          <NA>
## 13 1/13/14               44               34          Rain
## 14 1/14/14               37               26     Rain-Snow
## 15 1/15/14               27               18          Snow

6.2 Summarizing data frames

Some common data tasks include variable summaries such as means or standard deviations, transforming an existing variable, and creating new variables.

6.2.1 Column (and row) summaries

u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE, stringsAsFactors = FALSE, na.string = "")
names(WeatherKLAN2014Full)
##  [1] "EST"                       "Max.TemperatureF"         
##  [3] "Mean.TemperatureF"         "Min.TemperatureF"         
##  [5] "Max.Dew.PointF"            "MeanDew.PointF"           
##  [7] "Min.DewpointF"             "Max.Humidity"             
##  [9] "Mean.Humidity"             "Min.Humidity"             
## [11] "Max.Sea.Level.PressureIn"  "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn"  "Max.VisibilityMiles"      
## [15] "Mean.VisibilityMiles"      "Min.VisibilityMiles"      
## [17] "Max.Wind.SpeedMPH"         "Mean.Wind.SpeedMPH"       
## [19] "Max.Gust.SpeedMPH"         "PrecipitationIn"          
## [21] "CloudCover"                "Events"                   
## [23] "WindDirDegrees"

6.2.1.1 Mean

To compute mean() for each column we can:

mean(WeatherKLAN2014Full$Mean.TemperatureF)
## [1] 45.78082
mean(WeatherKLAN2014Full$Min.TemperatureF)
## [1] 36.25479
mean(WeatherKLAN2014Full$Max.TemperatureF)
## [1] 54.83836
## Et Cetera

However, this wastes a lot of time. We can save time by using colMeans() function which computes the mean of each (or specified) columns in a data frame.

colMeans(WeatherKLAN2014Full[, c(2:19, 21, 23)])
##          Max.TemperatureF         Mean.TemperatureF 
##                 54.838356                 45.780822 
##          Min.TemperatureF            Max.Dew.PointF 
##                 36.254795                 41.800000 
##            MeanDew.PointF             Min.DewpointF 
##                 36.394521                 30.156164 
##              Max.Humidity             Mean.Humidity 
##                 88.082192                 70.391781 
##              Min.Humidity  Max.Sea.Level.PressureIn 
##                 52.200000                 30.130247 
## Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
##                 30.015370                 29.903890 
##       Max.VisibilityMiles      Mean.VisibilityMiles 
##                  9.895890                  8.249315 
##       Min.VisibilityMiles         Max.Wind.SpeedMPH 
##                  4.824658                 19.101370 
##        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
##                  8.679452                        NA 
##                CloudCover            WindDirDegrees 
##                  4.367123                205.000000

6.2.2 The apply() function

R also has functions rowMeans(), colSums(), and rowSums(). The function apply() function applies a user-chosen function to either the rows or columns (or both) of a data frame. The arguments are X, the data frame of interest, MARGIN, specifying either rows (MARGIN = 1) or columns (MARGIN = 2), and FUN, the function to be applied.

apply(X = WeatherKLAN2014Full[, c(2:19, 21, 23)], MARGIN = 2,FUN = sd)
##          Max.TemperatureF         Mean.TemperatureF 
##                22.2129716                20.9729330 
##          Min.TemperatureF            Max.Dew.PointF 
##                20.2596676                19.5167159 
##            MeanDew.PointF             Min.DewpointF 
##                20.0311014                20.8511271 
##              Max.Humidity             Mean.Humidity 
##                 8.1909784                 9.3660269 
##              Min.Humidity  Max.Sea.Level.PressureIn 
##                13.9461681                 0.2032259 
## Mean.Sea.Level.PressureIn  Min.Sea.Level.PressureIn 
##                 0.2159484                 0.2359542 
##       Max.VisibilityMiles      Mean.VisibilityMiles 
##                 0.5790382                 2.1059259 
##       Min.VisibilityMiles         Max.Wind.SpeedMPH 
##                 3.8168143                 6.4831246 
##        Mean.Wind.SpeedMPH         Max.Gust.SpeedMPH 
##                 3.8862592                        NA 
##                CloudCover            WindDirDegrees 
##                 2.7798359                90.0673130

6.2.3 Saving typing using with()

The with() function tells R that we are working with a particular data frame, and we don’t need to keep typing the name of the data frame.

with(WeatherKLAN2014Full, mean(Max.TemperatureF[CloudCover <4 & Max.Humidity > 85]))
## [1] 69.39241

6.2.3.1 More examples

m <- matrix(c(1:9), nrow = 3, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

To calcutalte sum of each columns we can:

sum(m[,1])
## [1] 6
sum(m[,2])
## [1] 15
sum(m[,3])
## [1] 24
## or
 colSums(m)
## [1]  6 15 24

A better way is to use apply() function Syntax apply(x, MARGIN, FUN) x data MARGIN how to apply the function 1 = rows 2 = columns FUN the function to apply

apply(X=m, MARGIN=2,FUN=sum)
## [1]  6 15 24

What if your function requires multiple arguments? USE:

example: Consider m and z where z = c(1,2,3) Compute the correclation between z and each column of m cor(x, y) apply(m, 2, cor, y=z)

if MARGIN = c(1, 2), apply the function on both rows and columns

l,s,t apply If you have a list and you want to apply a function to each element of the list, USE: lapply(X, FUN, ...) returns a list lapply(X, FUN, ...) returns a vector/matrix

If you want to process data by groups, use tapply tapply (X, INDEX, FUN, ...) INDEX ≠ group factor (levels/groups)

6.3 Transforming a data frame

Commonly variables are added to, removed from, changed in, or rearranged in, a data frame.

library(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

6.3.1 Adding variables

The data frame contains per capita GDP and population, and it might be interesting to create a variable that gives the total GDP by multiplying these two variables.

gapminder$TotalGDP <- gapminder$gdpPercap * gapminder$pop
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  7 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##  $ TotalGDP : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

Analogous to the with() function, there is a function within() which can simplify the syntax. Whereas with() does not change the data frame, within() can. Note, below I first remove the altered gapminder dataframe using rm() then bring a clean copy back in by reloading the gapminder package.

rm(gapminder)
library(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
gapminder <- within(gapminder, TotalGDP <- gdpPercap * pop)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  7 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
##  $ TotalGDP : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...

A nice feature of within() is its ability to add more than one variable at a time to a data frame. In this case the two or more formulas creating new variables must be enclosed in braces.

gapminder <- within(gapminder, {TotalGDP <- gdpPercap * pop 
lifeExpMonths <- lifeExp * 12})
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  8 variables:
##  $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
##  $ pop          : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap    : num  779 821 853 836 740 ...
##  $ TotalGDP     : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
##  $ lifeExpMonths: num  346 364 384 408 433 ...

6.3.2 Removing variables

After reflection we may realize the new variables we added to the gapminder data frame are not useful, and should be removed.

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  8 variables:
##  $ country      : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent    : Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year         : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp      : num  28.8 30.3 32 34 36.1 ...
##  $ pop          : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap    : num  779 821 853 836 740 ...
##  $ TotalGDP     : num  6.57e+09 7.59e+09 8.76e+09 9.65e+09 9.68e+09 ...
##  $ lifeExpMonths: num  346 364 384 408 433 ...
gapminder <- gapminder[1:6]
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

The same result could be obtained via gapminder <- gapminder[, 1:6].

a <- data.frame(x = 1:3, y = c("dog", "cat", "pig"), z = seq(from = 1,to = 2, length = 3))
a
##   x   y   z
## 1 1 dog 1.0
## 2 2 cat 1.5
## 3 3 pig 2.0
b <- a[1]
b
##   x
## 1 1
## 2 2
## 3 3
b <- a[, 1]
b
## [1] 1 2 3
c <- a[-(2:3)]
c
##   x
## 1 1
## 2 2
## 3 3
#d <- a[-2:3]

One can also use a negative sign in front of the variable number(s). For example, a[-(2:3)] would drop the first two columns of a.

6.3.3 Transforming variables

Imagine we want to modify the existing variable to measure life expectancy in months.

rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.801 30.332 31.997 34.020 36.088
gapminder$lifeExp <- gapminder$lifeExp * 12
gapminder$lifeExp[1:5]
## [1] 345.612 363.984 383.964 408.240 433.056
rm(gapminder)
library(gapminder)
gapminder$lifeExp[1:5]
## [1] 28.801 30.332 31.997 34.020 36.088
gapminder <- within(gapminder, lifeExp <- lifeExp * 12)
gapminder$lifeExp[1:5]
## [1] 345.612 363.984 383.964 408.240 433.056

6.3.4 Rearranging variables

Consider the full weather data set again.

u.weather <- "http://blue.for.msu.edu/FOR875/data/WeatherKLAN2014Full.csv"
WeatherKLAN2014Full <- read.csv(u.weather, header=TRUE,
                                stringsAsFactors = FALSE,
                                na.string = "")
names(WeatherKLAN2014Full)
##  [1] "EST"                       "Max.TemperatureF"         
##  [3] "Mean.TemperatureF"         "Min.TemperatureF"         
##  [5] "Max.Dew.PointF"            "MeanDew.PointF"           
##  [7] "Min.DewpointF"             "Max.Humidity"             
##  [9] "Mean.Humidity"             "Min.Humidity"             
## [11] "Max.Sea.Level.PressureIn"  "Mean.Sea.Level.PressureIn"
## [13] "Min.Sea.Level.PressureIn"  "Max.VisibilityMiles"      
## [15] "Mean.VisibilityMiles"      "Min.VisibilityMiles"      
## [17] "Max.Wind.SpeedMPH"         "Mean.Wind.SpeedMPH"       
## [19] "Max.Gust.SpeedMPH"         "PrecipitationIn"          
## [21] "CloudCover"                "Events"                   
## [23] "WindDirDegrees"

If we want the wind speed variables to come right after the date, we can again use subsetting.

WeatherKLAN2014Full <- WeatherKLAN2014Full[c(1, 17, 18, 19, 2:16,20:23)]
names(WeatherKLAN2014Full)
##  [1] "EST"                       "Max.Wind.SpeedMPH"        
##  [3] "Mean.Wind.SpeedMPH"        "Max.Gust.SpeedMPH"        
##  [5] "Max.TemperatureF"          "Mean.TemperatureF"        
##  [7] "Min.TemperatureF"          "Max.Dew.PointF"           
##  [9] "MeanDew.PointF"            "Min.DewpointF"            
## [11] "Max.Humidity"              "Mean.Humidity"            
## [13] "Min.Humidity"              "Max.Sea.Level.PressureIn" 
## [15] "Mean.Sea.Level.PressureIn" "Min.Sea.Level.PressureIn" 
## [17] "Max.VisibilityMiles"       "Mean.VisibilityMiles"     
## [19] "Min.VisibilityMiles"       "PrecipitationIn"          
## [21] "CloudCover"                "Events"                   
## [23] "WindDirDegrees"

6.4 Reshaping data

A data set can be represented in several different formats. (Wide or Long)

6.4.1 Tidyr

The R library tidyr has functions for converting data between formats. A tidy dataset has: 1. Each variable has its own column 2. Each observation has its own row 3. Each value has its own cell

Useful tidy functions: 1. gather() 2. unite() 3. separate() 4. spread ()

u.rel <- "http://blue.for.msu.edu/FOR875/data/religion2.csv"
religion <- read.csv(u.rel, header = TRUE, stringsAsFactors = FALSE)
head(religion)
##             religion under10k btw10and20k btw20and30k btw30and40k
## 1           Agnostic       27          34          60          81
## 2            Atheist       12          27          37          52
## 3           Buddhist       27          21          30          34
## 4           Catholic      418         617         732         670
## 5 DoNotKnowOrRefused       15          14          15          11
## 6    EvangelicalProt      575         869        1064         982
##   btw40and50k btw50and75k btw75and100k btw100and150k over150k
## 1          76         137          122           109       84
## 2          35          70           73            59       74
## 3          33          58           62            39       53
## 4         638        1116          949           792      633
## 5          10          35           21            17       18
## 6         881        1486          949           723      414
##   DoNotKnowOrRefused
## 1                 96
## 2                 76
## 3                 54
## 4               1489
## 5                116
## 6               1529

6.4.1.1 Examples of tidy tables

1. TABLE 1

library(tidyr)

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

1. TABLE 2

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

1. TABLE 3

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

1. TABLE 4

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

The gather() function can transform data from wide to long format. It is useful when variable is spread acrosss multiple columns

To use gather() we specified the data frame (data = religion), the name we want to give to the column created from the income levels (key = IncomeLevel), the name we want to give to the column containing the frequency values (value = Frequency) and the columns to gather (2:11).

library(tidyr)
religionLong <- gather(data = religion, key = IncomeLevel, value = Frequency, 2:11)
head(religionLong)
##             religion IncomeLevel Frequency
## 1           Agnostic    under10k        27
## 2            Atheist    under10k        12
## 3           Buddhist    under10k        27
## 4           Catholic    under10k       418
## 5 DoNotKnowOrRefused    under10k        15
## 6    EvangelicalProt    under10k       575
tail(religionLong)
##                religion        IncomeLevel Frequency
## 175              Muslim DoNotKnowOrRefused        22
## 176            Orthodox DoNotKnowOrRefused        73
## 177      OtherChristian DoNotKnowOrRefused        18
## 178         OtherFaiths DoNotKnowOrRefused        71
## 179 OtherWorldReligions DoNotKnowOrRefused         8
## 180        Unaffiliated DoNotKnowOrRefused       597

When observation is scattered across multiple rows. We can use spread(). This changes the data from Long -> wide format

religionWide <- spread(data = religionLong, key = IncomeLevel, value = Frequency)
head(religionWide)
##             religion btw100and150k btw10and20k btw20and30k btw30and40k
## 1           Agnostic           109          34          60          81
## 2            Atheist            59          27          37          52
## 3           Buddhist            39          21          30          34
## 4           Catholic           792         617         732         670
## 5 DoNotKnowOrRefused            17          14          15          11
## 6    EvangelicalProt           723         869        1064         982
##   btw40and50k btw50and75k btw75and100k DoNotKnowOrRefused over150k
## 1          76         137          122                 96       84
## 2          35          70           73                 76       74
## 3          33          58           62                 54       53
## 4         638        1116          949               1489      633
## 5          10          35           21                116       18
## 6         881        1486          949               1529      414
##   under10k
## 1       27
## 2       12
## 3       27
## 4      418
## 5       15
## 6      575

Here we specify the data frame (religionLong), the column (IncomeLevel) to be spread, and the column of values (Frequency) to be spread among the newly created columns.

tidyr provides two other useful functions to separate and unite variables based on some deliminator. Consider again the yearlyIncomeWide table. Say we want to split the name variable into first and last name. This can be done using the separate() function. It is useful when cells have multiple values

yearlyIncomeLong <- data.frame(name = c("John Son", "Jim Doe", "Sam Nil"), y = c("dog", "cat", "pig"), z = seq(from = 1,to = 2, length = 3))

firstLast <- separate(data = yearlyIncomeLong, col = name, into = c("first", "last"), sep="\\s")
print(firstLast)
##   first last   y   z
## 1  John  Son dog 1.0
## 2   Jim  Doe cat 1.5
## 3   Sam  Nil pig 2.0

If you want to combine the name column again, This is done using the unite() function. It is useful when a single value in multiple cells

unite(firstLast, col = name, first, last, sep = "_")
##       name   y   z
## 1 John_Son dog 1.0
## 2  Jim_Doe cat 1.5
## 3  Sam_Nil pig 2.0

Examples

separate(data = table3, col = "rate", into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
## * <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
gather(table4a, key = "year", value = "cases", -1)
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

6.5 Manipulating data with dplyr

mutate() adds new variables that are functions of existing variables, select() picks variables based on their names, filter() picks cases based on their values, summarize() reduces multiple values down to a single summary, arrange() changes the ordering of the rows.

These all combine naturally with a group_by() that allows you to perform any operation grouped by values of one or more variables.

6.5.1 Improved data frames

tbl_df() creates a tibble3.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(religionWide)
##             religion btw100and150k btw10and20k btw20and30k btw30and40k
## 1           Agnostic           109          34          60          81
## 2            Atheist            59          27          37          52
## 3           Buddhist            39          21          30          34
## 4           Catholic           792         617         732         670
## 5 DoNotKnowOrRefused            17          14          15          11
## 6    EvangelicalProt           723         869        1064         982
##   btw40and50k btw50and75k btw75and100k DoNotKnowOrRefused over150k
## 1          76         137          122                 96       84
## 2          35          70           73                 76       74
## 3          33          58           62                 54       53
## 4         638        1116          949               1489      633
## 5          10          35           21                116       18
## 6         881        1486          949               1529      414
##   under10k
## 1       27
## 2       12
## 3       27
## 4      418
## 5       15
## 6      575
religionWideTbl <- tbl_df(religionWide)
head(religionWideTbl)
## # A tibble: 6 x 11
##   religion btw100and150k btw10and20k btw20and30k btw30and40k btw40and50k
##   <chr>            <int>       <int>       <int>       <int>       <int>
## 1 Agnostic           109          34          60          81          76
## 2 Atheist             59          27          37          52          35
## 3 Buddhist            39          21          30          34          33
## 4 Catholic           792         617         732         670         638
## 5 DoNotKn…            17          14          15          11          10
## 6 Evangel…           723         869        1064         982         881
## # ... with 5 more variables: btw50and75k <int>, btw75and100k <int>,
## #   DoNotKnowOrRefused <int>, over150k <int>, under10k <int>

6.5.2 Filtering data by row

The read.delim() function defaults to header = TRUE

The filter() is used to select rows based on certain criteria of columns varibles

u.gm <- "http://blue.for.msu.edu/FOR875/data/gapminder.tsv"
gm <- read.delim(u.gm)
gm <- tbl_df(gm)

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  346 364 384 408 433 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
head(gm)
## # A tibble: 6 x 6
##   country      year      pop continent lifeExp gdpPercap
##   <fct>       <int>    <dbl> <fct>       <dbl>     <dbl>
## 1 Afghanistan  1952  8425333 Asia         28.8      779.
## 2 Afghanistan  1957  9240934 Asia         30.3      821.
## 3 Afghanistan  1962 10267083 Asia         32.0      853.
## 4 Afghanistan  1967 11537966 Asia         34.0      836.
## 5 Afghanistan  1972 13079460 Asia         36.1      740.
## 6 Afghanistan  1977 14880372 Asia         38.4      786.

Filtering helps us to examine subsets of the data such as data from a particular country or from several specified countries, data from certain years, from countries with certain populations, etc. Some examples:

filter(gm, country == "Brazil")
## # A tibble: 12 x 6
##    country  year       pop continent lifeExp gdpPercap
##    <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
##  1 Brazil   1952  56602560 Americas     50.9     2109.
##  2 Brazil   1957  65551171 Americas     53.3     2487.
##  3 Brazil   1962  76039390 Americas     55.7     3337.
##  4 Brazil   1967  88049823 Americas     57.6     3430.
##  5 Brazil   1972 100840058 Americas     59.5     4986.
##  6 Brazil   1977 114313951 Americas     61.5     6660.
##  7 Brazil   1982 128962939 Americas     63.3     7031.
##  8 Brazil   1987 142938076 Americas     65.2     7807.
##  9 Brazil   1992 155975974 Americas     67.1     6950.
## 10 Brazil   1997 168546719 Americas     69.4     7958.
## 11 Brazil   2002 179914212 Americas     71.0     8131.
## 12 Brazil   2007 190010647 Americas     72.4     9066.
filter(gm, country %in% c("Brazil", "Mexico"))
## # A tibble: 24 x 6
##    country  year       pop continent lifeExp gdpPercap
##    <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
##  1 Brazil   1952  56602560 Americas     50.9     2109.
##  2 Brazil   1957  65551171 Americas     53.3     2487.
##  3 Brazil   1962  76039390 Americas     55.7     3337.
##  4 Brazil   1967  88049823 Americas     57.6     3430.
##  5 Brazil   1972 100840058 Americas     59.5     4986.
##  6 Brazil   1977 114313951 Americas     61.5     6660.
##  7 Brazil   1982 128962939 Americas     63.3     7031.
##  8 Brazil   1987 142938076 Americas     65.2     7807.
##  9 Brazil   1992 155975974 Americas     67.1     6950.
## 10 Brazil   1997 168546719 Americas     69.4     7958.
## # ... with 14 more rows
filter(gm, country %in% c("Brazil", "Mexico") & year %in% c(1952, 1972))
## # A tibble: 4 x 6
##   country  year       pop continent lifeExp gdpPercap
##   <fct>   <int>     <dbl> <fct>       <dbl>     <dbl>
## 1 Brazil   1952  56602560 Americas     50.9     2109.
## 2 Brazil   1972 100840058 Americas     59.5     4986.
## 3 Mexico   1952  30144317 Americas     50.8     3478.
## 4 Mexico   1972  55984294 Americas     62.4     6809.
filter(gm, pop > 300000000)
## # A tibble: 25 x 6
##    country  year         pop continent lifeExp gdpPercap
##    <fct>   <int>       <dbl> <fct>       <dbl>     <dbl>
##  1 China    1952  556263528. Asia         44        400.
##  2 China    1957  637408000  Asia         50.5      576.
##  3 China    1962  665770000  Asia         44.5      488.
##  4 China    1967  754550000  Asia         58.4      613.
##  5 China    1972  862030000  Asia         63.1      677.
##  6 China    1977  943455000  Asia         64.0      741.
##  7 China    1982 1000281000  Asia         65.5      962.
##  8 China    1987 1084035000  Asia         67.3     1379.
##  9 China    1992 1164970000  Asia         68.7     1656.
## 10 China    1997 1230075000  Asia         70.4     2289.
## # ... with 15 more rows
filter(gm, pop > 300000000 & year == 2007)
## # A tibble: 3 x 6
##   country        year        pop continent lifeExp gdpPercap
##   <fct>         <int>      <dbl> <fct>       <dbl>     <dbl>
## 1 China          2007 1318683096 Asia         73.0     4959.
## 2 India          2007 1110396331 Asia         64.7     2452.
## 3 United States  2007  301139947 Americas     78.2    42952.

6.5.3 Selecting variables by column

The select() function will choose varibles of dataframe.

Variables can be selected by name or column number. As usual, a negative sign tells R to leave something out. And there are special functions such as starts_with that provide ways to match part of a variable’s name.

select(gm, country, year, lifeExp)
## # A tibble: 1,704 x 3
##    country      year lifeExp
##    <fct>       <int>   <dbl>
##  1 Afghanistan  1952    28.8
##  2 Afghanistan  1957    30.3
##  3 Afghanistan  1962    32.0
##  4 Afghanistan  1967    34.0
##  5 Afghanistan  1972    36.1
##  6 Afghanistan  1977    38.4
##  7 Afghanistan  1982    39.9
##  8 Afghanistan  1987    40.8
##  9 Afghanistan  1992    41.7
## 10 Afghanistan  1997    41.8
## # ... with 1,694 more rows
select(gm, 2:4)
## # A tibble: 1,704 x 3
##     year      pop continent
##    <int>    <dbl> <fct>    
##  1  1952  8425333 Asia     
##  2  1957  9240934 Asia     
##  3  1962 10267083 Asia     
##  4  1967 11537966 Asia     
##  5  1972 13079460 Asia     
##  6  1977 14880372 Asia     
##  7  1982 12881816 Asia     
##  8  1987 13867957 Asia     
##  9  1992 16317921 Asia     
## 10  1997 22227415 Asia     
## # ... with 1,694 more rows
select(gm, -c(2, 3, 4))
## # A tibble: 1,704 x 3
##    country     lifeExp gdpPercap
##    <fct>         <dbl>     <dbl>
##  1 Afghanistan    28.8      779.
##  2 Afghanistan    30.3      821.
##  3 Afghanistan    32.0      853.
##  4 Afghanistan    34.0      836.
##  5 Afghanistan    36.1      740.
##  6 Afghanistan    38.4      786.
##  7 Afghanistan    39.9      978.
##  8 Afghanistan    40.8      852.
##  9 Afghanistan    41.7      649.
## 10 Afghanistan    41.8      635.
## # ... with 1,694 more rows
select(gm, starts_with("c"))
## # A tibble: 1,704 x 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Afghanistan Asia     
##  3 Afghanistan Asia     
##  4 Afghanistan Asia     
##  5 Afghanistan Asia     
##  6 Afghanistan Asia     
##  7 Afghanistan Asia     
##  8 Afghanistan Asia     
##  9 Afghanistan Asia     
## 10 Afghanistan Asia     
## # ... with 1,694 more rows

6.5.4 Pipes

Consider selecting the country, year, and population for countries in Asia or Europe. One possibility is to nest a filter() function inside a select() function.

select(filter(gm, continent %in% c("Asia", "Europe")), country,year, pop)
## # A tibble: 756 x 3
##    country      year      pop
##    <fct>       <int>    <dbl>
##  1 Afghanistan  1952  8425333
##  2 Afghanistan  1957  9240934
##  3 Afghanistan  1962 10267083
##  4 Afghanistan  1967 11537966
##  5 Afghanistan  1972 13079460
##  6 Afghanistan  1977 14880372
##  7 Afghanistan  1982 12881816
##  8 Afghanistan  1987 13867957
##  9 Afghanistan  1992 16317921
## 10 Afghanistan  1997 22227415
## # ... with 746 more rows

There is a nice feature in dplyr that allows us to feed results of one function into the first argument of a subsequent function. The %>% operator does the piping.

gm %>% filter(continent %in% c("Asia", "Europe")) %>% select(country, year, pop)
## # A tibble: 756 x 3
##    country      year      pop
##    <fct>       <int>    <dbl>
##  1 Afghanistan  1952  8425333
##  2 Afghanistan  1957  9240934
##  3 Afghanistan  1962 10267083
##  4 Afghanistan  1967 11537966
##  5 Afghanistan  1972 13079460
##  6 Afghanistan  1977 14880372
##  7 Afghanistan  1982 12881816
##  8 Afghanistan  1987 13867957
##  9 Afghanistan  1992 16317921
## 10 Afghanistan  1997 22227415
## # ... with 746 more rows

6.5.5 Arranging data by row

By default the gapminder data are arranged by country and then by year The arrange() reoders the rows

head(gm, 10)
## # A tibble: 10 x 6
##    country      year      pop continent lifeExp gdpPercap
##    <fct>       <int>    <dbl> <fct>       <dbl>     <dbl>
##  1 Afghanistan  1952  8425333 Asia         28.8      779.
##  2 Afghanistan  1957  9240934 Asia         30.3      821.
##  3 Afghanistan  1962 10267083 Asia         32.0      853.
##  4 Afghanistan  1967 11537966 Asia         34.0      836.
##  5 Afghanistan  1972 13079460 Asia         36.1      740.
##  6 Afghanistan  1977 14880372 Asia         38.4      786.
##  7 Afghanistan  1982 12881816 Asia         39.9      978.
##  8 Afghanistan  1987 13867957 Asia         40.8      852.
##  9 Afghanistan  1992 16317921 Asia         41.7      649.
## 10 Afghanistan  1997 22227415 Asia         41.8      635.

Possibly arranging the data by year and then country would be desired. The arrange() function makes this easy. We will again use pipes.

gm %>% arrange(year, country)
## # A tibble: 1,704 x 6
##    country      year      pop continent lifeExp gdpPercap
##    <fct>       <int>    <dbl> <fct>       <dbl>     <dbl>
##  1 Afghanistan  1952  8425333 Asia         28.8      779.
##  2 Albania      1952  1282697 Europe       55.2     1601.
##  3 Algeria      1952  9279525 Africa       43.1     2449.
##  4 Angola       1952  4232095 Africa       30.0     3521.
##  5 Argentina    1952 17876956 Americas     62.5     5911.
##  6 Australia    1952  8691212 Oceania      69.1    10040.
##  7 Austria      1952  6927772 Europe       66.8     6137.
##  8 Bahrain      1952   120447 Asia         50.9     9867.
##  9 Bangladesh   1952 46886859 Asia         37.5      684.
## 10 Belgium      1952  8730405 Europe       68       8343.
## # ... with 1,694 more rows

How about the data for Rwanda, arranged in order of life expectancy.

gm %>% filter(country == "Rwanda") %>% arrange(lifeExp)
## # A tibble: 12 x 6
##    country  year     pop continent lifeExp gdpPercap
##    <fct>   <int>   <dbl> <fct>       <dbl>     <dbl>
##  1 Rwanda   1992 7290203 Africa       23.6      737.
##  2 Rwanda   1997 7212583 Africa       36.1      590.
##  3 Rwanda   1952 2534927 Africa       40        493.
##  4 Rwanda   1957 2822082 Africa       41.5      540.
##  5 Rwanda   1962 3051242 Africa       43        597.
##  6 Rwanda   2002 7852401 Africa       43.4      786.
##  7 Rwanda   1987 6349365 Africa       44.0      848.
##  8 Rwanda   1967 3451079 Africa       44.1      511.
##  9 Rwanda   1972 3992121 Africa       44.6      591.
## 10 Rwanda   1977 4657072 Africa       45        670.
## 11 Rwanda   1982 5507565 Africa       46.2      882.
## 12 Rwanda   2007 8860588 Africa       46.2      863.

Possibly we want these data to be in decreasing (descending) order. Here, desc() is one of many dplyr helper functions.

gm %>% filter(country == "Rwanda") %>% arrange(desc(lifeExp))
## # A tibble: 12 x 6
##    country  year     pop continent lifeExp gdpPercap
##    <fct>   <int>   <dbl> <fct>       <dbl>     <dbl>
##  1 Rwanda   2007 8860588 Africa       46.2      863.
##  2 Rwanda   1982 5507565 Africa       46.2      882.
##  3 Rwanda   1977 4657072 Africa       45        670.
##  4 Rwanda   1972 3992121 Africa       44.6      591.
##  5 Rwanda   1967 3451079 Africa       44.1      511.
##  6 Rwanda   1987 6349365 Africa       44.0      848.
##  7 Rwanda   2002 7852401 Africa       43.4      786.
##  8 Rwanda   1962 3051242 Africa       43        597.
##  9 Rwanda   1957 2822082 Africa       41.5      540.
## 10 Rwanda   1952 2534927 Africa       40        493.
## 11 Rwanda   1997 7212583 Africa       36.1      590.
## 12 Rwanda   1992 7290203 Africa       23.6      737.

Possibly we want to include only the year and life expectancy, to make the message more stark.

gm %>% filter(country == "Rwanda") %>% select(year, lifeExp) %>% arrange(desc(lifeExp))
## # A tibble: 12 x 2
##     year lifeExp
##    <int>   <dbl>
##  1  2007    46.2
##  2  1982    46.2
##  3  1977    45  
##  4  1972    44.6
##  5  1967    44.1
##  6  1987    44.0
##  7  2002    43.4
##  8  1962    43  
##  9  1957    41.5
## 10  1952    40  
## 11  1997    36.1
## 12  1992    23.6

6.5.6 Renaming variables

The dplyr package has a rename function that makes renaming variables in a data frame quite easy.

gm <- rename(gm, population = pop)
head(gm)
## # A tibble: 6 x 6
##   country      year population continent lifeExp gdpPercap
##   <fct>       <int>      <dbl> <fct>       <dbl>     <dbl>
## 1 Afghanistan  1952    8425333 Asia         28.8      779.
## 2 Afghanistan  1957    9240934 Asia         30.3      821.
## 3 Afghanistan  1962   10267083 Asia         32.0      853.
## 4 Afghanistan  1967   11537966 Asia         34.0      836.
## 5 Afghanistan  1972   13079460 Asia         36.1      740.
## 6 Afghanistan  1977   14880372 Asia         38.4      786.

6.5.7 Data summaries and grouping

The summarize() function computes summary statistics or user provided function for one or more columns of data in a data frame.

summarize(gm, meanpop = mean(population), medpop = median(population))
## # A tibble: 1 x 2
##     meanpop   medpop
##       <dbl>    <dbl>
## 1 29601212. 7023596.

For example, we might want the median life expectancy for each continent separately. One option is subsetting:

median(gm$lifeExp[gm$continent == "Africa"])
## [1] 47.792
median(gm$lifeExp[gm$continent == "Asia"])
## [1] 61.7915
median(gm$lifeExp[gm$continent == "Europe"])
## [1] 72.241
median(gm$lifeExp[gm$continent == "Americas"])
## [1] 67.048
median(gm$lifeExp[gm$continent == "Oceania"])
## [1] 73.665

The group_by() function makes this easier, and makes the output more useful.

gm %>% group_by(continent) %>% summarize(medLifeExp = median(lifeExp))
## # A tibble: 5 x 2
##   continent medLifeExp
##   <fct>          <dbl>
## 1 Africa          47.8
## 2 Americas        67.0
## 3 Asia            61.8
## 4 Europe          72.2
## 5 Oceania         73.7

Or if we want the results ordered by the median life expectancy:

gm %>% group_by(continent) %>% summarize(medLifeExp = median(lifeExp)) %>% arrange(medLifeExp)
## # A tibble: 5 x 2
##   continent medLifeExp
##   <fct>          <dbl>
## 1 Africa          47.8
## 2 Asia            61.8
## 3 Americas        67.0
## 4 Europe          72.2
## 5 Oceania         73.7

We can calculate the number of observations we have per continent (using the n() helper function),

gm %>% group_by(continent) %>% summarize(numObs = n())
## # A tibble: 5 x 2
##   continent numObs
##   <fct>      <int>
## 1 Africa       624
## 2 Americas     300
## 3 Asia         396
## 4 Europe       360
## 5 Oceania       24
gm %>% group_by(continent) %>% summarize(n_obs = n(), n_countries = n_distinct(country))
## # A tibble: 5 x 3
##   continent n_obs n_countries
##   <fct>     <int>       <int>
## 1 Africa      624          52
## 2 Americas    300          25
## 3 Asia        396          33
## 4 Europe      360          30
## 5 Oceania      24           2

Here is a bit more involved example that calculates the minimum and maximum life expectancies for countries in Africa by year.

gm %>% filter(continent == "Africa") %>% group_by(year) %>% summarize(min_lifeExp = min(lifeExp), max_lifeExp = max(lifeExp))
## # A tibble: 12 x 3
##     year min_lifeExp max_lifeExp
##    <int>       <dbl>       <dbl>
##  1  1952        30          52.7
##  2  1957        31.6        58.1
##  3  1962        32.8        60.2
##  4  1967        34.1        61.6
##  5  1972        35.4        64.3
##  6  1977        36.8        67.1
##  7  1982        38.4        69.9
##  8  1987        39.9        71.9
##  9  1992        23.6        73.6
## 10  1997        36.1        74.8
## 11  2002        39.2        75.7
## 12  2007        39.6        76.4

This is interesting, but the results don’t include the countries that achieved the minimum and maximum life expectancies.

gm %>% select(country, continent, year, lifeExp) %>% group_by(year) %>% arrange(year) %>% filter(rank(lifeExp) == 1)
## # A tibble: 12 x 4
## # Groups:   year [12]
##    country      continent  year lifeExp
##    <fct>        <fct>     <int>   <dbl>
##  1 Afghanistan  Asia       1952    28.8
##  2 Afghanistan  Asia       1957    30.3
##  3 Afghanistan  Asia       1962    32.0
##  4 Afghanistan  Asia       1967    34.0
##  5 Sierra Leone Africa     1972    35.4
##  6 Cambodia     Asia       1977    31.2
##  7 Sierra Leone Africa     1982    38.4
##  8 Angola       Africa     1987    39.9
##  9 Rwanda       Africa     1992    23.6
## 10 Rwanda       Africa     1997    36.1
## 11 Zambia       Africa     2002    39.2
## 12 Swaziland    Africa     2007    39.6

Next we add the maximum life expectancy

gm %>% select(country, continent, year, lifeExp) %>% group_by(year) %>% arrange(year) %>% filter(rank(lifeExp) == 1 | rank(desc(lifeExp)) == 1) %>% print(n=24)
## # A tibble: 24 x 4
## # Groups:   year [12]
##    country      continent  year lifeExp
##    <fct>        <fct>     <int>   <dbl>
##  1 Afghanistan  Asia       1952    28.8
##  2 Norway       Europe     1952    72.7
##  3 Afghanistan  Asia       1957    30.3
##  4 Iceland      Europe     1957    73.5
##  5 Afghanistan  Asia       1962    32.0
##  6 Iceland      Europe     1962    73.7
##  7 Afghanistan  Asia       1967    34.0
##  8 Sweden       Europe     1967    74.2
##  9 Sierra Leone Africa     1972    35.4
## 10 Sweden       Europe     1972    74.7
## 11 Cambodia     Asia       1977    31.2
## 12 Iceland      Europe     1977    76.1
## 13 Japan        Asia       1982    77.1
## 14 Sierra Leone Africa     1982    38.4
## 15 Angola       Africa     1987    39.9
## 16 Japan        Asia       1987    78.7
## 17 Japan        Asia       1992    79.4
## 18 Rwanda       Africa     1992    23.6
## 19 Japan        Asia       1997    80.7
## 20 Rwanda       Africa     1997    36.1
## 21 Japan        Asia       2002    82  
## 22 Zambia       Africa     2002    39.2
## 23 Japan        Asia       2007    82.6
## 24 Swaziland    Africa     2007    39.6

6.5.8 Creating new variables

The $ notation provides a simple way to create new variables in a data frame. The mutate() function provides another, sometimes cleaner way to do this. It is used to add a new variable to the dataframe

gm %>% group_by(country) %>% mutate(changeLifeExp = lifeExp -
+ lag(lifeExp, order_by = year)) %>% select(-c(population,
+ gdpPercap))
## # A tibble: 1,704 x 5
## # Groups:   country [142]
##    country      year continent lifeExp changeLifeExp
##    <fct>       <int> <fct>       <dbl>         <dbl>
##  1 Afghanistan  1952 Asia         28.8       NA     
##  2 Afghanistan  1957 Asia         30.3        1.53  
##  3 Afghanistan  1962 Asia         32.0        1.66  
##  4 Afghanistan  1967 Asia         34.0        2.02  
##  5 Afghanistan  1972 Asia         36.1        2.07  
##  6 Afghanistan  1977 Asia         38.4        2.35  
##  7 Afghanistan  1982 Asia         39.9        1.42  
##  8 Afghanistan  1987 Asia         40.8        0.968 
##  9 Afghanistan  1992 Asia         41.7        0.852 
## 10 Afghanistan  1997 Asia         41.8        0.0890
## # ... with 1,694 more rows

6.5.5 Class Examples:

data("mtcars")
mtcars <- as_tibble(mtcars)
mtcars
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows
filter(mtcars, cyl==6 & am ==1)
## # A tibble: 3 x 11
##     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
## 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
## 3  19.7     6   145   175  3.62  2.77  15.5     0     1     5     6
arrange(mtcars, hp)
## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
##  2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
##  3  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
##  4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
##  5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
##  6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
##  7  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
##  8  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
##  9  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
## 10  18.1     6 225     105  2.76  3.46  20.2     1     0     3     1
## # ... with 22 more rows
ratio_mtcars <- mutate(mtcars, ratio = hp/mpg)
ratio_mtcars
## # A tibble: 32 x 12
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb ratio
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4  5.24
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4  5.24
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1  4.08
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1  5.14
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2  9.36
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1  5.80
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4 17.1 
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2  2.54
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2  4.17
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4  6.41
## # ... with 22 more rows
group_by(mtcars, cyl) %>% summarise(mean.hp =mean(hp))
## # A tibble: 3 x 2
##     cyl mean.hp
##   <dbl>   <dbl>
## 1     4    82.6
## 2     6   122. 
## 3     8   209.

8.0 Simulation

This chapter will provide a brief introduction to using R for simulationt

The purpose of simulations: 1. Test statistical intuition 2. Validate theory 3. Experiemnt with a new technique

Generate random numbers from different distribution with r i.e rnorm rbinom rchi

To generate from a finite populattion use sample() sample( X, size, replace=F, prob=NULL) X = vector of data to sample

Monty Hall Ploblem Is it better to stay or switch doors? Or does it not matter?

8.1 Pseudo-random numbers

Simulations use numbers generated by a pseudo-random number generator (PRNG) rather than by a truly random number generator.

Linear congruential generators are a common starting point. A linear congruential generator requires a multiplier (a), an increment (c), a modulus (m), and a seed X0, all integers. The generator is defined by

\(X_i = (aX_{i-1} + c)\) mod m:

Recall that Z mod m gives the remainder when Z is divided by m. In R Z mod m is implemented as Z %% m.

15%%2
## [1] 1
15%%5
## [1] 0
15%%6
## [1] 3

The function lcg() below implements this idea in R.

lcg <- function(a, c, seed, m, iter) {
    out <- numeric(iter)
    out[1] <- seed
    for (i in 2:(iter)) {
        out[i] <- (a * out[i - 1] + c)%%m
    }
    return(out)
}
lcg(a = 65, c = 1, seed = 0, m = 2048, iter = 10)/2048
##  [1] 0.0000000000 0.0004882812 0.0322265625 0.0952148438 0.1894531250
##  [6] 0.3149414062 0.4716796875 0.6596679688 0.8789062500 0.1293945312
lcg(a = 1365, c = 1, seed = 0, m = 2048, iter = 10)/2048
##  [1] 0.0000000000 0.0004882812 0.6669921875 0.4448242188 0.1855468750
##  [6] 0.2719726562 0.2431640625 0.9194335938 0.0273437500 0.3247070312
lcg(a = 1229, c = 1, seed = 0, m = 2048, iter = 10)/2048
##  [1] 0.0000000000 0.0004882812 0.6005859375 0.1206054688 0.2246093750
##  [6] 0.0454101562 0.8095703125 0.9624023438 0.7929687500 0.5590820312

A very basic way to assess the performance of a PRNG is to draw a histogram of the values. The histogram should look like the histogram of uniformly distributed values.

tmp_random <- data.frame(x = lcg(a = 1229, c = 1, seed = 0, m = 2048, iter = 2048)/2048)
ggplot(tmp_random, aes(x = x)) + geom_histogram(binwidth = 0.1)

Being able to provide a seed is important for repeatability of simulations. The set.seed() function in R provides this capability

set.seed(123)
runif(5)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
set.seed(123)
runif(5)
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673
runif(5)
## [1] 0.0455565 0.5281055 0.8924190 0.5514350 0.4566147

8.2 Simulating from standard distributions

For example suppose that we want to generate coin flips, i.e., to generate from a Bernoulli (0.5) distribution. Here is one way to do this.

generate_coin <- function(n) {
    out <- rep(0, n)
    out[runif(n) > 0.5] <- 1
    return(out)
}
generate_coin(10)
##  [1] 1 0 1 1 0 1 0 0 0 1

If we wanted H and T for heads and tails, a similar method would work.

generate_coin <- function(n) {
    out <- rep("T", n)
    out[runif(n) > 0.5] <- "H"
    return(out)
}
generate_coin(10)
##  [1] "H" "H" "H" "H" "H" "H" "H" "H" "T" "T"

R provides functions that give the density (or mass function), cumulative distribution function, quantile function, and generate values.

  1. dnorm is the normal density function
  2. pnorm is the normal cumulative distribution function
  3. qnorm is the normal quantile function
  4. rnorm generates values from a normal distribution

Rather than designing our own methods for simulating from these distributions, it is better to use the built in functions.

# 10 values from the N(100, 2) distribution
rnorm(10, mean = 100, sd = 2)
##  [1] 103.57383 100.99570  96.06677 101.40271  99.05442  97.86435  99.56405
##  [8]  97.94799  98.54222  98.74992
# 5 values from the Uniform(-3, 4) distribution
runif(5, min = -3, max = 4)
## [1] -2.67918183  0.09540052  2.59247392 -2.14670518  0.92663589
# 7 values from the binomial(10, 0.2) distribution
rbinom(7, size = 10, p = 0.2)
## [1] 1 1 3 4 1 2 0

In addition, the sample function provides a way to sample from a finite population.

# 10 coin flips
sample(c("H", "T"), 10, replace = TRUE)
##  [1] "H" "H" "T" "H" "T" "T" "T" "H" "T" "T"
# a random permutation
sample(1:9, 9, replace = FALSE)
## [1] 7 1 4 2 6 3 5 8 9
# sample from a distribution with P(X=2) = 0.3; P(X=3)=0.5;  and P(X=17)=0.2
sample(c(2, 3, 17), size = 15, replace = TRUE, prob = c(0.3, 0.5, 0.2))
##  [1]  2  3  2  3  3 17 17 17  3  3  2  3  2  3  3

8.3 Investigating t tests

Inverse CDF method (Inverse Transform Sampling)

Let X be our random variable of interest and assume X has CDF, \(F_x(x)\)

GOAL: Generate random varibles from X

ALGORITHM 1. Generate \(U=Unif(0, 1)\) 2. Find x such that \(F_x(x) = U\) => \(x = F_x^-1 (U)\)

Theorem: Let X have continous CDF and define the variable Y such that \(Y=Unif(0,1)\) \(Y = F_x(x)\) Then \(Y = Unif(0,1)\)

Proof: \(P(Y=< y)\) = \(P(F_x(x) =<y)\)

Solve: \(x_1 = F_x^-1 (U_i)\) U is uniform \(1-e^(z/x_1) = U_\) \(-e^(-z/x_i) = u_i -1\) \(x_i = -zln(i-u_i)\)

You can control random number generation with set.seed(). R uses Mersenne Twitter algorithm to generate random numbers based off the seed

An Example: Algorithm 1. Select a seed 2. Multiply the seed by itself 3. Output the middle number 4. Use the result as the new seed

seed = 14
seed * seed # 196 -> 19
## [1] 196
19 * 19 # 361 -> 36
## [1] 361
36 * 36 # 1296 - 29
## [1] 1296

9.0 Classifcation

Classifcation methods are both extremely useful and an active area of research in statistics. In this chapter we will learn about two common, and somewhat different, classification methods, logistic regression and k nearest neighbors

9.1 Logistic regression

We would like to predict the value of Y based on the value of the predictor X. Let p(X) = P(Y = 1jX). We will model p(X) with the logistic function, which takes values between 0 and 1, which is of course appropriate for modeling a probability:

library(ggplot2)
logistic <- function(x) {
    exp(x)/(1 + exp(x))
}
ggplot(data.frame(x = c(-6, 6)), aes(x)) + stat_function(fun = logistic)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
head(Pima.tr)
##   npreg glu bp skin  bmi   ped age type
## 1     5  86 68   28 30.2 0.364  24   No
## 2     7 195 70   33 25.1 0.163  55  Yes
## 3     5  77 82   41 35.8 0.156  35   No
## 4     0 165 76   43 47.9 0.259  26   No
## 5     0 107 60   25 26.4 0.133  23   No
## 6     5  97 76   27 35.6 0.378  52  Yes

It will be more convenient to code the presence or absence of diabetes by 1 and 0, so we first create another column in the data frame with this coding:

Pima.tr$diabetes <- rep(0, dim(Pima.tr)[1])
Pima.tr$diabetes[Pima.tr$type == "Yes"] <- 1
head(Pima.tr)
##   npreg glu bp skin  bmi   ped age type diabetes
## 1     5  86 68   28 30.2 0.364  24   No        0
## 2     7 195 70   33 25.1 0.163  55  Yes        1
## 3     5  77 82   41 35.8 0.156  35   No        0
## 4     0 165 76   43 47.9 0.259  26   No        0
## 5     0 107 60   25 26.4 0.133  23   No        0
## 6     5  97 76   27 35.6 0.378  52  Yes        1

In R we will use the function glm to fit logistic regression models The glm function fits a wide variety of models.

To specify the logistic regression model we specify family = binomial as an argument to glm.

  1. predictor and response variables (diabetes ~ glu)
diabetes.lr1 <- glm(diabetes ~ glu, data = Pima.tr, family = binomial)
diabetes.lr1
## 
## Call:  glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)
## 
## Coefficients:
## (Intercept)          glu  
##    -5.50364      0.03778  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       256.4 
## Residual Deviance: 207.4     AIC: 211.4
summary(diabetes.lr1)
## 
## Call:
## glm(formula = diabetes ~ glu, family = binomial, data = Pima.tr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9714  -0.7795  -0.5292   0.8491   2.2633  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.503636   0.836077  -6.583 4.62e-11 ***
## glu          0.037784   0.006278   6.019 1.76e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 256.41  on 199  degrees of freedom
## Residual deviance: 207.37  on 198  degrees of freedom
## AIC: 211.37
## 
## Number of Fisher Scoring iterations: 4
beta0.lr.1 <- coef(diabetes.lr1)[1]
beta1.lr.1 <- coef(diabetes.lr1)[2]
beta0.lr.1
## (Intercept) 
##   -5.503636
beta1.lr.1
##        glu 
## 0.03778372

The coefficients b0 and b1 are approximately b0 = -5.504 and b1 = 0:038. So for example we can estimate the probability that a woman in this population whose glucose level is 150 would be diabetic as:

exp(-5.504 + 0.038 * 150)/(1 + exp(-5.504 + 0.038 * 150))
## [1] 0.5488437

We can plot the fitted probabilities along with the data by hand.

diabetes.logistic.1 <- function(x){
    exp(beta0.lr.1 + beta1.lr.1 * x)/(1 + exp(beta0.lr.1 + beta1.lr.1 * x))
}
ggplot(Pima.tr, aes(x = glu, y = diabetes)) + 
    stat_function(fun = diabetes.logistic.1)+
    geom_point()

The ggplot2 package also provides a way to do this more directly, using stat_smooth.

ggplot(Pima.tr, aes(x = glu, y = diabetes)) +
    geom_point() +
    stat_smooth(method = "glm", method.args = list(family = "binomial"), se = F)

From these graphics we can see that although glucose level and diabetes are related, there are many women with high glucose levels who are not diabetic, and many with low glucose levels who are diabetic, so likely adding other predictors to the model will help.

head(Pima.te)
##   npreg glu bp skin  bmi   ped age type
## 1     6 148 72   35 33.6 0.627  50  Yes
## 2     1  85 66   29 26.6 0.351  31   No
## 3     1  89 66   23 28.1 0.167  21   No
## 4     3  78 50   32 31.0 0.248  26  Yes
## 5     2 197 70   45 30.5 0.158  53  Yes
## 6     5 166 72   19 25.8 0.587  51  Yes
diabetes.probs.1 <- predict(diabetes.lr1, Pima.te, type = "response")
head(diabetes.probs.1)
##          1          2          3          4          5          6 
## 0.52207428 0.09178605 0.10518608 0.07199064 0.87432542 0.68318799

The predict function (with type = "response" specifed) provides p(x) = P(Y = 1jX = x) for all the x values in a data frame.

diabetes.predict.1 <- rep("No", dim(Pima.te)[1])
diabetes.predict.1[diabetes.probs.1 > 0.5] <- "Yes"
head(diabetes.predict.1)
## [1] "Yes" "No"  "No"  "No"  "Yes" "Yes"
length(diabetes.predict.1[diabetes.predict.1 == Pima.te$type])/dim(Pima.te)[1]
## [1] 0.7740964

9.1.1 Adding predictors

We will next see how adding bmi, the body mass index, affects predictions of diabetes.

diabetes.lr2 <- glm(diabetes ~ glu + bmi, data = Pima.tr, family = binomial)
diabetes.lr2
## 
## Call:  glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)
## 
## Coefficients:
## (Intercept)          glu          bmi  
##    -8.21611      0.03572      0.09002  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       256.4 
## Residual Deviance: 198.5     AIC: 204.5
summary(diabetes.lr2)
## 
## Call:
## glm(formula = diabetes ~ glu + bmi, family = binomial, data = Pima.tr)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0577  -0.7566  -0.4313   0.8011   2.2489  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.216106   1.346965  -6.100 1.06e-09 ***
## glu          0.035716   0.006311   5.659 1.52e-08 ***
## bmi          0.090016   0.031268   2.879  0.00399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 256.41  on 199  degrees of freedom
## Residual deviance: 198.47  on 197  degrees of freedom
## AIC: 204.47
## 
## Number of Fisher Scoring iterations: 4
diabetes.probs.2 <- predict(diabetes.lr2, Pima.te, type = "response")
head(diabetes.probs.2)
##          1          2          3          4          5          6 
## 0.52358599 0.05809584 0.07530477 0.06662362 0.82713369 0.50879270
diabetes.predict.2 <- rep("No", dim(Pima.te)[1])
diabetes.predict.2[diabetes.probs.2 > 0.5] <- "Yes"
head(diabetes.predict.2)
## [1] "Yes" "No"  "No"  "No"  "Yes" "Yes"
table(diabetes.predict.2, Pima.te$type)
##                   
## diabetes.predict.2  No Yes
##                No  204  54
##                Yes  19  55
length(diabetes.predict.2[diabetes.predict.2 == Pima.te$type])/dim(Pima.te)[1]
## [1] 0.7801205

Adding bmi as a predictor did not improve the predictions by very much!

lr.int <- -coef(diabetes.lr2)[1]/coef(diabetes.lr2)[3]
lr.slope <- -coef(diabetes.lr2)[2]/coef(diabetes.lr2)[3]
ggplot(Pima.tr, aes(x = glu, y = bmi)) + geom_point(aes(color = type)) +
    geom_abline(intercept = lr.int, slope = lr.slope)

9.1.2 More than two classes

Logistic regression methods also are applicable to classification contexts where there are more than two classes. Consider for example Fisher’s iris data.

data(iris)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
    geom_point(aes(color = Species))

ggplot(data = iris, aes(x = Petal.Length, y = Petal.Width)) +
    geom_point(aes(color = Species))

Here the potential predictors are sepal width and length and petal width and length, and the goal is to find a classifier that will yield the correct species.

Before doing that we randomly choose 75 of the 150 rows of the data frame to be the training sample, with the other 75 being the test sample.

set.seed(321)
selected <- sample(1:150, replace = FALSE, size = 75)
iris.train <- iris[selected, ]
iris.test <- iris[-selected, ]

There are several packages which implement logistic regression for data with more than two classes. We will use the VGAM package. The function vglm within VGAM implements logistic regression (and much more).

library(VGAM)
## Loading required package: stats4
## Loading required package: splines
## 
## Attaching package: 'VGAM'
## The following object is masked _by_ '.GlobalEnv':
## 
##     logistic
## The following object is masked from 'package:tidyr':
## 
##     fill
iris.lr <- vglm(Species ~ Petal.Width + Petal.Length, data = iris.train, family = multinomial)
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 6 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 12 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 14 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 16 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12

## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 16 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 22 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 25 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 28 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 31 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 36 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 37 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 44 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 45 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 50 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in checkwz(wz, M = M, trace = trace, wzepsilon =
## control$wzepsilon): 54 diagonal elements of the working weights variable
## 'wz' have been replaced by 1.819e-12
## Warning in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
## convergence not obtained in 21 IRLS iterations
summary(iris.lr)
## 
## Call:
## vglm(formula = Species ~ Petal.Width + Petal.Length, family = multinomial, 
##     data = iris.train)
## 
## 
## Pearson residuals:
##                           Min         1Q    Median        3Q       Max
## log(mu[,1]/mu[,3]) -1.813e-06 -2.464e-07 4.360e-08 6.302e-08 1.644e-05
## log(mu[,2]/mu[,3]) -1.265e+00 -1.223e-04 6.589e-07 2.993e-03 2.471e+00
## 
## Coefficients: 
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept):1    118.947   9260.552      NA       NA
## (Intercept):2     63.510     34.004      NA       NA
## Petal.Width:1    -27.338  14731.810  -0.002    0.999
## Petal.Width:2    -14.554      7.466      NA       NA
## Petal.Length:1   -26.612   6453.201  -0.004    0.997
## Petal.Length:2    -8.242      5.591      NA       NA
## 
## Number of linear predictors:  2 
## 
## Names of linear predictors: log(mu[,1]/mu[,3]), log(mu[,2]/mu[,3])
## 
## Residual deviance: 8.6819 on 144 degrees of freedom
## 
## Log-likelihood: -4.341 on 144 degrees of freedom
## 
## Number of iterations: 21 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):1', '(Intercept):2', 'Petal.Width:2', 'Petal.Length:2'
## 
## Reference group is level  3  of the response

Notice that the family is specifed as multinomial rather than binomial since we have more than two classes. When run with these data, the vglm function returns several (about 20) warnings. These occur mainly because the classes are so easily separated, and are suppressed above.

Next we compute the probabilities for the test data.

iris.probs <- predict(iris.lr, iris.test[, c(3, 4)], type = "response")
head(iris.probs)
##    setosa   versicolor    virginica
## 4       1 1.003592e-11 1.129031e-32
## 5       1 1.598575e-12 7.887705e-34
## 9       1 1.598575e-12 7.887705e-34
## 11      1 1.003592e-11 1.129031e-32
## 12      1 6.300597e-11 1.616073e-31
## 14      1 1.799143e-15 1.747518e-38

For these, we want to choose the highest probability in each row.

which.max(c(2, 3, 1, 5, 8, 3))
## [1] 5
which.max(c(2, 20, 4, 5, 9, 1, 0))
## [1] 2
class.predictions <- apply(iris.probs, 1, which.max)
head(class.predictions)
##  4  5  9 11 12 14 
##  1  1  1  1  1  1
class.predictions[class.predictions == 1] <- levels(iris$Species)[1]
class.predictions[class.predictions == 2] <- levels(iris$Species)[2]
class.predictions[class.predictions == 3] <- levels(iris$Species)[3]
head(class.predictions)
##        4        5        9       11       12       14 
## "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

Next we create the confusion matrix.

table(class.predictions, iris.test$Species)
##                  
## class.predictions setosa versicolor virginica
##        setosa         26          0         0
##        versicolor      0         19         1
##        virginica       0          2        27

9.2 Nearest neighbor methods

Nearest neighbor methods provide a rather different way to construct classifiers, and have strengths (minimal assumptions, exible decision boundaries) and weaknesses (computational burden, lack of interpretability) compared to logistic regression models.

There are at least three R packages which implement kNN, including class, kknn, and RWeka. We will use class below.

u.knn <- "http://blue.for.msu.edu/FOR875/data/knnExample.csv"
knnExample <- read.csv(u.knn, header=TRUE)
str(knnExample)
## 'data.frame':    200 obs. of  3 variables:
##  $ x1   : num  2.5261 0.367 0.7682 0.6934 -0.0198 ...
##  $ x2   : num  0.3211 0.0315 0.7175 0.7772 0.8673 ...
##  $ class: int  0 0 0 0 0 0 0 0 0 0 ...
ggplot(data = knnExample, aes(x = x1, y = x2)) +
    geom_point(aes(color = as.factor(class))) +
    theme_bw()

expand.grid(x = c(1, 2), y = c(5, 3.4, 2))
##   x   y
## 1 1 5.0
## 2 2 5.0
## 3 1 3.4
## 4 2 3.4
## 5 1 2.0
## 6 2 2.0
min(knnExample$x1)
## [1] -2.52082
max(knnExample$x1)
## [1] 4.170746
min(knnExample$x2)
## [1] -1.999853
max(knnExample$x2)
## [1] 2.855805
x.test <- expand.grid(x1 = seq(-2.6, 4.2, by = 0.1), x2 = seq(-2, 2.9, by = 0.1))
library(class)
Example_knn <- knn(knnExample[, c(1, 2)], x.test, knnExample[,3], k = 15, prob = TRUE)
prob <- attr(Example_knn, "prob")
head(prob)
## [1] 0.6666667 0.6666667 0.6666667 0.7333333 0.7333333 0.7333333
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
knn(train, test, cl, k = 3, prob=TRUE)
##  [1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c c c c v c
## [36] c c c c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v v v
## [71] v v v v v
## attr(,"prob")
##  [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
## [29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
## [36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000 1.0000000
## [57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000 1.0000000
## [64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667
## Levels: c s v
library(dplyr)
df1 <- mutate(x.test, prob = prob, class = 0, prob_cls = ifelse(Example_knn == class, 1, 0))
str(df1)
## 'data.frame':    3450 obs. of  5 variables:
##  $ x1      : num  -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
##  $ x2      : num  -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
##  $ prob    : num  0.667 0.667 0.667 0.733 0.733 ...
##  $ class   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prob_cls: num  1 1 1 1 1 1 1 1 1 1 ...
df2 <- mutate(x.test, prob = prob, class = 1, prob_cls = ifelse(Example_knn == class, 1, 0))
bigdf <- bind_rows(df1, df2)

names(knnExample)
## [1] "x1"    "x2"    "class"
ggplot(bigdf) + 
    geom_point(aes(x = x1, y = x2, col = class),data = mutate(x.test, class = Example_knn), size = 0.5) +
    geom_point(aes(x = x1, y = x2, col = as.factor(class)), size = 4, shape = 1, data = knnExample) + 
    geom_contour(aes(x = x1, y = x2, z = prob_cls, group = as.factor(class), color = as.factor(class)), size = 1, bins = 1, data = bigdf) + 
    theme_bw()

Example_knn <- knn(knnExample[, c(1, 2)], x.test, knnExample[,3], k = 1, prob = TRUE) 
prob <- attr(Example_knn, "prob")
head(prob)
## [1] 1 1 1 1 1 1
df1 <- mutate(x.test, prob = prob, class = 0, prob_cls = ifelse(Example_knn == class, 1, 0))
str(df1)
## 'data.frame':    3450 obs. of  5 variables:
##  $ x1      : num  -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 ...
##  $ x2      : num  -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
##  $ prob    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ class   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ prob_cls: num  1 1 1 1 1 1 1 1 1 1 ...
df2 <- mutate(x.test, prob = prob, class = 1, prob_cls = ifelse(Example_knn == class, 1, 0))
bigdf <- bind_rows(df1, df2)
ggplot(bigdf) + geom_point(aes(x = x1, y = x2, col = class), data = mutate(x.test, class = Example_knn), size = 0.5) +
    geom_point(aes(x = x1, y = x2, col = as.factor(class)), size = 4, shape = 1, data = knnExample) + 
    geom_contour(aes(x = x1, y = x2, z = prob_cls, group = as.factor(class), color = as.factor(class)), size = 1, bins = 1, data = bigdf) + 
    theme_bw()

9.2.1 kNN and the diabetes data

Next kNN is applied to the diabetes data. We will use the same predictors, glu and bmi that were used in the logistic regression example. Since the scales of the predictor variables are substantially different, they are standardized first. The value k = 15 is chosen for kNN.

Pima.tr[, 1:7] <- scale(Pima.tr[, 1:7], center = TRUE, scale = TRUE)
Pima.te[, 1:7] <- scale(Pima.te[, 1:7], center = TRUE, scale = TRUE)
knn_Pima <- knn(Pima.tr[, c(2, 5)], Pima.te[, c(2, 5)], Pima.tr[, 8], k = 15, prob = TRUE)
table(knn_Pima, Pima.te[, 8])
##         
## knn_Pima  No Yes
##      No  206  55
##      Yes  17  54

9.2.2 kNN and the iris data

Now kNN is used to classify the iris data. As before we use petal length and width as predictors

sd(iris.train$Petal.Width)
## [1] 0.728585
sd(iris.train$Petal.Length)
## [1] 1.671873
head(iris.train)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 144          6.8         3.2          5.9         2.3  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 36           5.0         3.2          1.2         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 58           4.9         2.4          3.3         1.0 versicolor
## 50           5.0         3.3          1.4         0.2     setosa
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 1, prob = TRUE)
table(knn_iris, iris.test[, 5])
##             
## knn_iris     setosa versicolor virginica
##   setosa         26          0         0
##   versicolor      0         20         1
##   virginica       0          1        27
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 3, prob = TRUE)
table(knn_iris, iris.test[, 5])
##             
## knn_iris     setosa versicolor virginica
##   setosa         26          0         0
##   versicolor      0         19         1
##   virginica       0          2        27
knn_iris <- knn(iris.train[, c(3, 4)], iris.test[, c(3, 4)], iris.train[, 5], k = 15, prob= TRUE)
table(knn_iris, iris.test[, 5])
##             
## knn_iris     setosa versicolor virginica
##   setosa         26          0         0
##   versicolor      0         19         1
##   virginica       0          2        27

Text Data

Examples of text data sets

1. Enron email data set
2. Play by play sports data (S.Hauschka 37 yd Field Goal)
3. Unabomber Manifesto

Many applications require the ability to manipulate and process text data. For example, an email spam filter

10.1 Reading text data into R

Two functions to read in text data:

  1. scan function - read data in from a file (more option)
    scan(file, what, sep)
    what is the variable type
    sep is the deliminator (end of file line)
    The notation for sep argument are:
    " " for the space
    "\n" for new line
    "\t" for tab
  2. readLines function - to read url.
u.email <- "http://blue.for.msu.edu/FOR875/data/email1.txt"
email1 <- scan(u.email, what = "character", sep = "")
length(email1)
## [1] 133
email1[1:10]
##  [1] "From"                           "safety33o@l11.newnamedns.com"  
##  [3] "Fri"                            "Aug"                           
##  [5] "23"                             "11:03:37"                      
##  [7] "2002"                           "Return-Path:"                  
##  [9] "<safety33o@l11.newnamedns.com>" "Delivered-To:"
email1[1:10]
##  [1] "From"                           "safety33o@l11.newnamedns.com"  
##  [3] "Fri"                            "Aug"                           
##  [5] "23"                             "11:03:37"                      
##  [7] "2002"                           "Return-Path:"                  
##  [9] "<safety33o@l11.newnamedns.com>" "Delivered-To:"
email1 <- scan(u.email, what = "character", sep = "\n")
length(email1)
## [1] 26
email1[1:10]
##  [1] "From safety33o@l11.newnamedns.com  Fri Aug 23 11:03:37 2002"              
##  [2] "Return-Path: <safety33o@l11.newnamedns.com>"                              
##  [3] "Delivered-To: zzzz@localhost.example.com"                                 
##  [4] "Received: from localhost (localhost [127.0.0.1])"                         
##  [5] "\tby phobos.labs.example.com (Postfix) with ESMTP id 5AC994415F"          
##  [6] "\tfor <zzzz@localhost>; Fri, 23 Aug 2002 06:02:59 -0400 (EDT)"            
##  [7] "Received: from mail.webnote.net [193.120.211.219]"                        
##  [8] "\tby localhost with POP3 (fetchmail-5.9.0)"                               
##  [9] "\tfor zzzz@localhost (single-drop); Fri, 23 Aug 2002 11:02:59 +0100 (IST)"
## [10] "Received: from l11.newnamedns.com ([64.25.38.81])"

The scan function is quite flexible. In fact, read.table uses scan to actually read in the data. Read the help file for scan if more information is desired.

u.moby <- "http://blue.for.msu.edu/FOR875/data/mobydick.txt"
moby_dick <- scan(u.moby, what = "character", sep = "\n")
moby_dick[1:25]
##  [1] "The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman Melville"
##  [2] "This eBook is for the use of anyone anywhere at no cost and with"          
##  [3] "almost no restrictions whatsoever.  You may copy it, give it away or"      
##  [4] "re-use it under the terms of the Project Gutenberg License included"       
##  [5] "with this eBook or online at www.gutenberg.org"                            
##  [6] "Title: Moby Dick; or The Whale"                                            
##  [7] "Author: Herman Melville"                                                   
##  [8] "Last Updated: January 3, 2009"                                             
##  [9] "Posting Date: December 25, 2008 [EBook #2701]"                             
## [10] "Release Date: June, 2001"                                                  
## [11] "Language: English"                                                         
## [12] "*** START OF THIS PROJECT GUTENBERG EBOOK MOBY DICK; OR THE WHALE ***"     
## [13] "Produced by Daniel Lazarus and Jonesey"                                    
## [14] "MOBY DICK; OR THE WHALE"                                                   
## [15] "By Herman Melville"                                                        
## [16] "Original Transcriber's Notes:"                                             
## [17] "This text is a combination of etexts, one from the now-defunct ERIS"       
## [18] "project at Virginia Tech and one from Project Gutenberg's archives. The"   
## [19] "proofreaders of this version are indebted to The University of Adelaide"   
## [20] "Library for preserving the Virginia Tech version. The resulting etext"     
## [21] "was compared with a public domain hard copy version of the text."          
## [22] "In chapters 24, 89, and 90, we substituted a capital L for the symbol"     
## [23] "for the British pound, a unit of currency."                                
## [24] "ETYMOLOGY."                                                                
## [25] "(Supplied by a Late Consumptive Usher to a Grammar School)"

You will notice that scan function ignored blank lines in the file. If it is important to preserve blank lines, the argument blank.lines.skip = FALSE can be supplied to scan.

The file containing the novel contains some introductory and closing text that is not part of the original novel.

moby_dick <- moby_dick[408:18576]
length(moby_dick)
## [1] 18169

10.2 The paste function

The paste function concatenates vectors after (if necessary) converting the vectors to character.

paste("Homer Simpson", "is", "Bart Simpsons", "father")
## [1] "Homer Simpson is Bart Simpsons father"
n <- 10
paste("The value of n is", n)
## [1] "The value of n is 10"
paste(c("pig", "dog"), 3)
## [1] "pig 3" "dog 3"

By default the paste function separates the input vectors with a space. But other separators can be specified

paste("mail", "google", "com", sep = ".")
## [1] "mail.google.com"
paste(c("dog", "cat", "horse", "human", "elephant"), "food")
## [1] "dog food"      "cat food"      "horse food"    "human food"   
## [5] "elephant food"
paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"))
## [1] "one six"     "two seven"   "three eight" "four nine"   "five ten"

To create one character element, we use collapse

paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"), collapse = ".")
## [1] "one six.two seven.three eight.four nine.five ten"
paste(c("one", "two", "three", "four", "five"), c("six", "seven", "eight", "nine", "ten"), collapse = " ")
## [1] "one six two seven three eight four nine five ten"
paste(c("a", "b"), 1:10, sep = "")
##  [1] "a1"  "b2"  "a3"  "b4"  "a5"  "b6"  "a7"  "b8"  "a9"  "b10"